Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Started by Xuneng Zhou5 months ago18 messages

xunengzhou@gmail.com

5 months ago

2 attachment(s)

Hi hackers,

During a performance run [1]/messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com, I observed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
suggests replacing the current check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by the walsender:

1) Problem and Background
/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when waits are long. I built a
POC patch that swaps polling for CVs, but a single global CV (or even
separate “flush” and “replay” CVs) isn’t ideal:
• The wake-up routines don’t know which LSN each waiter cares about,
so they would need to broadcast on every flush/replay.

• Caching the minimum outstanding target LSN could reduce spurious
wake-ups but won’t eliminate them when multiple backends wait for
different LSNs simultaneously.

• The walsender accepts some broadcast overhead via two CVs for
different waiters. A more precise approach would require a request
queue that maps waiters to target LSNs and issues targeted
wake-ups—adding complexity.

2) Proposal
I came across the thread “Implement waiting for WAL LSN replay:
reloaded” [2]/messages/by-id/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com by Alexander. The “Implement WAIT FOR” patch in that
thread provides a well-established infrastructure for waiting on WAL
replay in backends. With modest adjustments, it could be generalized.

Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling
with latch-based waiting:
• Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the
“WAIT FOR” work.

• Replace the busy-wait in read_local_xlog_page_guts() with
WaitForLSNFlush and WaitForLSNReplay.

• Add wake-up calls in XLogFlush and XLogBackgroundFlush.

Edge Case: Timeline Switch During Wait
/*
* Check which timeline to get the record from.
*
* We have to do it each time through the loop because if we're in
* recovery as a cascading standby, the current timeline might've
* become historical. We can't rely on RecoveryInProgress() because in
* a standby configuration like
*
* A => B => C
*
* if we're a logical decoding session on C, and B gets promoted, our
* timeline will change while we remain in recovery.
*
* We can't just keep reading from the old timeline as the last WAL
* archive in the timeline will get renamed to .partial by
* StartupXLOG().

read_local_xlog_page_guts() re-evaluates the active timeline on each
loop iteration because, on a cascading standby, the current timeline
can become historical. Once that happens, there’s no need to keep
waiting for that timeline. A timeline switch could therefore render an
in-progress wait unnecessary.

One option is to add a wake-up at the point where the timeline switch
occurs, so waiting processes exit promptly. The current approach
chooses not to do this, given that most waits are short and timeline
changes in cascading standby are rare. Supporting timeline-switch
wake-ups would also require additional handling in both
WaitForLSNFlush and WaitForLSNReplay, increasing complexity.

Comments and suggestions are welcome.

[1]: /messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com
[2]: /messages/by-id/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com

Best,
Xuneng

Attachments:

v8-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v8-0001-Implement-WAIT-FOR-command.patchDownload

From 4487999a6c393e42619ae77e5e7f14c6cac9f235 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Wed, 27 Aug 2025 09:12:38 +0800
Subject: [PATCH v8] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 219 ++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 388 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 284 +++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  29 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  90 ++++
 src/include/commands/wait.h                   |  21 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl       | 269 ++++++++++++
 src/tools/pgindent/typedefs.list              |   5 +-
 30 files changed, 1457 insertions(+), 15 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..ecaff5d5deb 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to the synchronous
+    replication
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..433901baa82
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,219 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ TIMEOUT <replaceable class="parameter">timeout</replaceable> ] [ NO_THROW ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    as soon as <literal>NO_THROW</literal> is not specified.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      When specified and <parameter>timeout</parameter> is greater than zero,
+      the command waits until <parameter>lsn</parameter> is reached or
+      the specified <parameter>timeout</parameter> has elapsed.
+     </para>
+     <para>
+      The <parameter>timeout</parameter> might be given as integer number of
+      milliseconds.  Also it might be given as string literal with
+      integer number of milliseconds or a number with unit
+      (see <xref linkend="config-setting-names-values"/>).
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>NO_THROW</literal></term>
+    <listitem>
+     <para>
+      Specify to not throw an error in the case of timeout or
+      running on the primary.  In this case the result status can be get from
+      the return value.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Return values</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+    However, if <function>pg_wal_replay_wait</function> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+  </para>
+
+  <para>
+    It may also happen that target <parameter>lsn</parameter> is not reached
+    within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+   </programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+   </programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..f5257dfa689 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6205,6 +6206,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f23ec8969c2..408454bb8b9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..2cc9312e836
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const		WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const		WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..cfa42ad6f6c
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "nodes/print.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+	WaitStmtParamNone,
+	WaitStmtParamTimeout,
+	WaitStmtParamLSN
+}			WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt * stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	WaitStmtParam curParam = WaitStmtParamNone;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	bool		o_lsn = false;
+	bool		o_timeout = false;
+	bool		o_no_throw = false;
+
+	foreach_ptr(Node, option, stmt->options)
+	{
+		if (IsA(option, String))
+		{
+			String	   *str = castNode(String, option);
+			char	   *name = str_tolower(str->sval, strlen(str->sval),
+										   DEFAULT_COLLATION_OID);
+
+			if (curParam != WaitStmtParamNone)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected parameter after \"%s\"", name)));
+
+			if (strcmp(name, "lsn") == 0)
+			{
+				if (o_lsn)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "lsn")));
+				o_lsn = true;
+				curParam = WaitStmtParamLSN;
+			}
+			else if (strcmp(name, "timeout") == 0)
+			{
+				if (o_timeout)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "timeout")));
+				o_timeout = true;
+				curParam = WaitStmtParamTimeout;
+			}
+			else if (strcmp(name, "no_throw") == 0)
+			{
+				if (o_no_throw)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "no_throw")));
+				o_no_throw = true;
+				throw = false;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized parameter \"%s\"", name)));
+
+		}
+		else if (IsA(option, Integer))
+		{
+			Integer    *intVal = castNode(Integer, option);
+
+			if (curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected integer value")));
+
+			timeout = intVal->ival;
+
+			curParam = WaitStmtParamNone;
+		}
+		else if (IsA(option, A_Const))
+		{
+			A_Const    *constVal = castNode(A_Const, option);
+			String	   *str = &constVal->val.sval;
+
+			if (curParam != WaitStmtParamLSN &&
+				curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected string value")));
+
+			if (curParam == WaitStmtParamLSN)
+			{
+				lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+													  CStringGetDatum(str->sval)));
+			}
+			else if (curParam == WaitStmtParamTimeout)
+			{
+				const char *hintmsg;
+				double		result;
+
+				if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("invalid value for timeout option: \"%s\"",
+									str->sval),
+							 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+				}
+
+				/*
+				 * Get rid of any fractional part in the input. This is so we don't fail
+				 * on just-out-of-range values that would round into range.
+				 */
+				result = rint(result);
+
+				/* Range check */
+				if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("timeout value is out of range for type bigint")));
+
+				timeout = (int64) result;
+			}
+
+			curParam = WaitStmtParamNone;
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("unexpected parameter type")));
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt * stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index db43034b9db..164fd23017c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
+%type <node>	wait_option
+%type <list>	wait_option_list
+
 
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16402,6 +16406,25 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR wait_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
+
+wait_option_list:
+			wait_option						{ $$ = list_make1($1); }
+			| wait_option_list wait_option	{ $$ = lappend($1, $2); }
+			;
+
+wait_option: ColLabel						{ $$ = (Node *) makeString($1); }
+			 | NumericOnly					{ $$ = (Node *) $1; }
+			 | Sconst						{ $$ = (Node *) makeStringConst($1, @1); }
+
+		;
 
 /*
  * Aggregate decoration clauses
@@ -18050,6 +18073,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18707,6 +18731,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..a1cb9f2473e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 4f4191b0ea6..880fa7807eb 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 5427da5bc1b..ee20a48b2c5 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -352,6 +353,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..72be2f76293
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+}			WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+}			WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+}			WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState * waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ef9e5f0c0be
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt * stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt * stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..b8d3fc009fb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+}			WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..da1cfeb1c52
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,269 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT -1000;",
+	stderr => \$stderr);
+ok($stderr =~ /timeout.*must not be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' UNKNOWN_PARAM;",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized parameter/,
+	"get error for unknown parameter");
+
+# Test duplicate LSN parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' LSN '${test_lsn}';",
+	stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate LSN parameter");
+
+# Test duplicate TIMEOUT parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT 1000 TIMEOUT 2000;",
+	stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' NO_THROW NO_THROW;",
+	stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;",
+	stderr => \$stderr);
+ok($stderr =~ /lsn.*must be specified/,
+	"get error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql('postgres', "WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT 'invalid';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid value for timeout option/,
+	"get error for invalid timeout format");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..f303f04d007 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -615,7 +615,6 @@ DatumTupleFields
 DbInfo
 DbInfoArr
 DbLocaleInfo
-DbOidName
 DeClonePtrType
 DeadLockState
 DeallocateStmt
@@ -2283,7 +2282,6 @@ PlannerParamItem
 Point
 Pointer
 PolicyInfo
-PolyNumAggState
 Pool
 PopulateArrayContext
 PopulateArrayState
@@ -4129,6 +4127,7 @@ tar_file
 td_entry
 teSection
 temp_tablespaces_extra
+test128
 test_re_flags
 test_regex_ctx
 test_shm_mq_header
@@ -4198,6 +4197,7 @@ varatt_expanded
 varattrib_1b
 varattrib_1b_e
 varattrib_4b
+vartag_external
 vbits
 verifier_context
 walrcv_alter_slot_fn
@@ -4326,7 +4326,6 @@ xmlGenericErrorFunc
 xmlNodePtr
 xmlNodeSetPtr
 xmlParserCtxtPtr
-xmlParserErrors
 xmlParserInputPtr
 xmlSaveCtxt
 xmlSaveCtxtPtr
-- 
2.49.0

v1-0001-Improve-read_local_xlog_page_guts-by-replacing-po.patchapplication/octet-stream; name=v1-0001-Improve-read_local_xlog_page_guts-by-replacing-po.patchDownload

From 969aa46761f0b835e1a2e5d1ab2824c58a04afed Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Wed, 27 Aug 2025 23:16:56 +0800
Subject: [PATCH v1] Improve read_local_xlog_page_guts by replacing polling
 with latch-based waiting

Replace inefficient polling loops in read_local_xlog_page_guts with latch-based waiting
when WAL data is not yet available.  This eliminates CPU-intensive busy waiting and improves
responsiveness by waking processes immediately when their target LSN becomes available.
---
 src/backend/access/transam/xlog.c             | 18 +++++
 src/backend/access/transam/xlogutils.c        | 48 +++++++++++---
 src/backend/access/transam/xlogwait.c         | 66 +++++++++++++++++++
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/access/xlogwait.h                 |  1 +
 5 files changed, 126 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f5257dfa689..f4b9f3c9799 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2913,6 +2913,15 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(&XLogCtl->logFlushResult >= pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+		WaitLSNWakeup(&XLogCtl->logFlushResult);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3088,6 +3097,15 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(&XLogCtl->logFlushResult >= pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+		WaitLSNWakeup(&XLogCtl->logFlushResult);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 27ea52fdfee..b7ba3f6a737 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -927,7 +923,6 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
-
 			if (loc <= read_upto)
 				break;
 
@@ -947,7 +942,44 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSNFlush(loc);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSNReplay(loc, 0);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						/* Shouldn't happen without timeout */
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 2cc9312e836..20c75e93cfa 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -386,3 +386,69 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 
 	return WAIT_LSN_RESULT_SUCCESS;
 }
+
+/*
+ * Wait for LSN to be flushed on primary server.
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to waiters */
+	addLSNWaiter(targetLSN);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);	
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap. We might
+	 * already be deleted by the waker process. The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	return;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index ee20a48b2c5..b2266666c02 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 72be2f76293..742c47609ae 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -86,5 +86,6 @@ extern void WaitLSNShmemInit(void);
 extern void WaitLSNWakeup(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
-- 
2.49.0

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Xuneng Zhou (#1)

1 attachment(s)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

Attached the wrong patch
v1-0001-Improve-read_local_xlog_page_guts-by-replacing-po.patch. The
correct one is attached again.

Show quoted text

On Wed, Aug 27, 2025 at 11:23 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi hackers,

During a performance run [1], I observed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
suggests replacing the current check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by the walsender:

1) Problem and Background
/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when waits are long. I built a
POC patch that swaps polling for CVs, but a single global CV (or even
separate “flush” and “replay” CVs) isn’t ideal:
• The wake-up routines don’t know which LSN each waiter cares about,
so they would need to broadcast on every flush/replay.

• Caching the minimum outstanding target LSN could reduce spurious
wake-ups but won’t eliminate them when multiple backends wait for
different LSNs simultaneously.

• The walsender accepts some broadcast overhead via two CVs for
different waiters. A more precise approach would require a request
queue that maps waiters to target LSNs and issues targeted
wake-ups—adding complexity.

2) Proposal
I came across the thread “Implement waiting for WAL LSN replay:
reloaded” [2] by Alexander. The “Implement WAIT FOR” patch in that
thread provides a well-established infrastructure for waiting on WAL
replay in backends. With modest adjustments, it could be generalized.

Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling
with latch-based waiting:
• Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the
“WAIT FOR” work.

• Replace the busy-wait in read_local_xlog_page_guts() with
WaitForLSNFlush and WaitForLSNReplay.

• Add wake-up calls in XLogFlush and XLogBackgroundFlush.

Edge Case: Timeline Switch During Wait
/*
* Check which timeline to get the record from.
*
* We have to do it each time through the loop because if we're in
* recovery as a cascading standby, the current timeline might've
* become historical. We can't rely on RecoveryInProgress() because in
* a standby configuration like
*
* A => B => C
*
* if we're a logical decoding session on C, and B gets promoted, our
* timeline will change while we remain in recovery.
*
* We can't just keep reading from the old timeline as the last WAL
* archive in the timeline will get renamed to .partial by
* StartupXLOG().

read_local_xlog_page_guts() re-evaluates the active timeline on each
loop iteration because, on a cascading standby, the current timeline
can become historical. Once that happens, there’s no need to keep
waiting for that timeline. A timeline switch could therefore render an
in-progress wait unnecessary.

One option is to add a wake-up at the point where the timeline switch
occurs, so waiting processes exit promptly. The current approach
chooses not to do this, given that most waits are short and timeline
changes in cascading standby are rare. Supporting timeline-switch
wake-ups would also require additional handling in both
WaitForLSNFlush and WaitForLSNReplay, increasing complexity.

Comments and suggestions are welcome.

[1] /messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com
[2] /messages/by-id/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com

Best,
Xuneng

Attachments:

v2-0001-Improve-read_local_xlog_page_guts-by-replacing-po.patchapplication/octet-stream; name=v2-0001-Improve-read_local_xlog_page_guts-by-replacing-po.patchDownload

From c6275a297e2478acf68260ce9f21c45c1e6da223 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Wed, 27 Aug 2025 23:28:12 +0800
Subject: [PATCH v2] Improve read_local_xlog_page_guts by replacing polling
 with latch-based waiting

Replace inefficient polling loops in read_local_xlog_page_guts with latch-based waiting
when WAL data is not yet available.  This eliminates CPU-intensive busy waiting and improves
responsiveness by waking processes immediately when their target LSN becomes available.
---
 src/backend/access/transam/xlog.c             | 18 +++++
 src/backend/access/transam/xlogutils.c        | 48 +++++++++++---
 src/backend/access/transam/xlogwait.c         | 66 +++++++++++++++++++
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/access/xlogwait.h                 |  1 +
 5 files changed, 126 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f5257dfa689..4af8f22f166 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2913,6 +2913,15 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+		WaitLSNWakeup(LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3088,6 +3097,15 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+		WaitLSNWakeup(LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 27ea52fdfee..b7ba3f6a737 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -927,7 +923,6 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
-
 			if (loc <= read_upto)
 				break;
 
@@ -947,7 +942,44 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSNFlush(loc);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSNReplay(loc, 0);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						/* Shouldn't happen without timeout */
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 2cc9312e836..20c75e93cfa 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -386,3 +386,69 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 
 	return WAIT_LSN_RESULT_SUCCESS;
 }
+
+/*
+ * Wait for LSN to be flushed on primary server.
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to waiters */
+	addLSNWaiter(targetLSN);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);	
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap. We might
+	 * already be deleted by the waker process. The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	return;
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index ee20a48b2c5..b2266666c02 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 72be2f76293..742c47609ae 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -86,5 +86,6 @@ extern void WaitLSNShmemInit(void);
 extern void WaitLSNWakeup(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
-- 
2.49.0

Xuneng Zhou

xunengzhou@gmail.com

5 months ago

In reply to: Xuneng Zhou (#2)

3 attachment(s)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

Some changes in v3:
1) Update the note of xlogwait.c to reflect the extending of its use
for flush waiting and internal use for both flush and replay waiting.
2) Update the comment above logical_read_xlog_page which describes the
prior-change behavior of read_local_xlog_page.

Best,
Xuneng

Attachments:

v8-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v8-0001-Implement-WAIT-FOR-command.patchDownload

From 4487999a6c393e42619ae77e5e7f14c6cac9f235 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Wed, 27 Aug 2025 09:12:38 +0800
Subject: [PATCH v8] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 219 ++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 388 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 284 +++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  29 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  90 ++++
 src/include/commands/wait.h                   |  21 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   7 +
 src/include/parser/kwlist.h                   |   1 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl       | 269 ++++++++++++
 src/tools/pgindent/typedefs.list              |   5 +-
 30 files changed, 1457 insertions(+), 15 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..ecaff5d5deb 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to the synchronous
+    replication
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..433901baa82
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,219 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ TIMEOUT <replaceable class="parameter">timeout</replaceable> ] [ NO_THROW ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    as soon as <literal>NO_THROW</literal> is not specified.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+    <listitem>
+     <para>
+      When specified and <parameter>timeout</parameter> is greater than zero,
+      the command waits until <parameter>lsn</parameter> is reached or
+      the specified <parameter>timeout</parameter> has elapsed.
+     </para>
+     <para>
+      The <parameter>timeout</parameter> might be given as integer number of
+      milliseconds.  Also it might be given as string literal with
+      integer number of milliseconds or a number with unit
+      (see <xref linkend="config-setting-names-values"/>).
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>NO_THROW</literal></term>
+    <listitem>
+     <para>
+      Specify to not throw an error in the case of timeout or
+      running on the primary.  In this case the result status can be get from
+      the return value.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Return values</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+    However, if <function>pg_wal_replay_wait</function> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+  </para>
+
+  <para>
+    It may also happen that target <parameter>lsn</parameter> is not reached
+    within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+   </programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+
+   <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+   </programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..f5257dfa689 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6205,6 +6206,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f23ec8969c2..408454bb8b9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..2cc9312e836
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const		WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const		WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..cfa42ad6f6c
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "nodes/print.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+	WaitStmtParamNone,
+	WaitStmtParamTimeout,
+	WaitStmtParamLSN
+}			WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt * stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	WaitStmtParam curParam = WaitStmtParamNone;
+
+	/*
+	 * Process the list of parameters.
+	 */
+	bool		o_lsn = false;
+	bool		o_timeout = false;
+	bool		o_no_throw = false;
+
+	foreach_ptr(Node, option, stmt->options)
+	{
+		if (IsA(option, String))
+		{
+			String	   *str = castNode(String, option);
+			char	   *name = str_tolower(str->sval, strlen(str->sval),
+										   DEFAULT_COLLATION_OID);
+
+			if (curParam != WaitStmtParamNone)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected parameter after \"%s\"", name)));
+
+			if (strcmp(name, "lsn") == 0)
+			{
+				if (o_lsn)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "lsn")));
+				o_lsn = true;
+				curParam = WaitStmtParamLSN;
+			}
+			else if (strcmp(name, "timeout") == 0)
+			{
+				if (o_timeout)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "timeout")));
+				o_timeout = true;
+				curParam = WaitStmtParamTimeout;
+			}
+			else if (strcmp(name, "no_throw") == 0)
+			{
+				if (o_no_throw)
+					ereport(ERROR,
+							(errcode(ERRCODE_SYNTAX_ERROR),
+							 errmsg("parameter \"%s\" specified more than once", "no_throw")));
+				o_no_throw = true;
+				throw = false;
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized parameter \"%s\"", name)));
+
+		}
+		else if (IsA(option, Integer))
+		{
+			Integer    *intVal = castNode(Integer, option);
+
+			if (curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected integer value")));
+
+			timeout = intVal->ival;
+
+			curParam = WaitStmtParamNone;
+		}
+		else if (IsA(option, A_Const))
+		{
+			A_Const    *constVal = castNode(A_Const, option);
+			String	   *str = &constVal->val.sval;
+
+			if (curParam != WaitStmtParamLSN &&
+				curParam != WaitStmtParamTimeout)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unexpected string value")));
+
+			if (curParam == WaitStmtParamLSN)
+			{
+				lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+													  CStringGetDatum(str->sval)));
+			}
+			else if (curParam == WaitStmtParamTimeout)
+			{
+				const char *hintmsg;
+				double		result;
+
+				if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+							 errmsg("invalid value for timeout option: \"%s\"",
+									str->sval),
+							 hintmsg ? errhint("%s", _(hintmsg)) : 0));
+				}
+
+				/*
+				 * Get rid of any fractional part in the input. This is so we don't fail
+				 * on just-out-of-range values that would round into range.
+				 */
+				result = rint(result);
+
+				/* Range check */
+				if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("timeout value is out of range for type bigint")));
+
+				timeout = (int64) result;
+			}
+
+			curParam = WaitStmtParamNone;
+		}
+		else
+			ereport(ERROR,
+					(errcode(ERRCODE_SYNTAX_ERROR),
+					 errmsg("unexpected parameter type")));
+	}
+
+	if (XLogRecPtrIsInvalid(lsn))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_PARAMETER),
+				 errmsg("\"lsn\" must be specified")));
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				 errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						(errcode(ERRCODE_QUERY_CANCELED),
+						 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+								LSN_FORMAT_ARGS(lsn),
+								LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+									   LSN_FORMAT_ARGS(lsn),
+									   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+				}
+				else
+				{
+					ereport(ERROR,
+							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							 errmsg("recovery is not in progress"),
+							 errhint("Waiting for the replay LSN can only be executed during recovery.")));
+				}
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt * stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index db43034b9db..164fd23017c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
+%type <node>	wait_option
+%type <list>	wait_option_list
+
 
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16402,6 +16406,25 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+WaitStmt:
+			WAIT FOR wait_option_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->options = $3;
+					$$ = (Node *)n;
+				}
+			;
+
+wait_option_list:
+			wait_option						{ $$ = list_make1($1); }
+			| wait_option_list wait_option	{ $$ = lappend($1, $2); }
+			;
+
+wait_option: ColLabel						{ $$ = (Node *) makeString($1); }
+			 | NumericOnly					{ $$ = (Node *) $1; }
+			 | Sconst						{ $$ = (Node *) makeStringConst($1, @1); }
+
+		;
 
 /*
  * Aggregate decoration clauses
@@ -18050,6 +18073,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18707,6 +18731,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..a1cb9f2473e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 4f4191b0ea6..880fa7807eb 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt((WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 5427da5bc1b..ee20a48b2c5 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -352,6 +353,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..72be2f76293
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+}			WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+}			WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+}			WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState * waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ef9e5f0c0be
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt * stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt * stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..b8d3fc009fb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	List	   *options;
+}			WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..da1cfeb1c52
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,269 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT -1000;",
+	stderr => \$stderr);
+ok($stderr =~ /timeout.*must not be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' UNKNOWN_PARAM;",
+	stderr => \$stderr);
+ok($stderr =~ /unrecognized parameter/,
+	"get error for unknown parameter");
+
+# Test duplicate LSN parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' LSN '${test_lsn}';",
+	stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate LSN parameter");
+
+# Test duplicate TIMEOUT parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT 1000 TIMEOUT 2000;",
+	stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' NO_THROW NO_THROW;",
+	stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;",
+	stderr => \$stderr);
+ok($stderr =~ /lsn.*must be specified/,
+	"get error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql('postgres', "WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT 'invalid';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid value for timeout option/,
+	"get error for invalid timeout format");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR lsn '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..f303f04d007 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -615,7 +615,6 @@ DatumTupleFields
 DbInfo
 DbInfoArr
 DbLocaleInfo
-DbOidName
 DeClonePtrType
 DeadLockState
 DeallocateStmt
@@ -2283,7 +2282,6 @@ PlannerParamItem
 Point
 Pointer
 PolicyInfo
-PolyNumAggState
 Pool
 PopulateArrayContext
 PopulateArrayState
@@ -4129,6 +4127,7 @@ tar_file
 td_entry
 teSection
 temp_tablespaces_extra
+test128
 test_re_flags
 test_regex_ctx
 test_shm_mq_header
@@ -4198,6 +4197,7 @@ varatt_expanded
 varattrib_1b
 varattrib_1b_e
 varattrib_4b
+vartag_external
 vbits
 verifier_context
 walrcv_alter_slot_fn
@@ -4326,7 +4326,6 @@ xmlGenericErrorFunc
 xmlNodePtr
 xmlNodeSetPtr
 xmlParserCtxtPtr
-xmlParserErrors
 xmlParserInputPtr
 xmlSaveCtxt
 xmlSaveCtxtPtr
-- 
2.49.0

v3-0000-cover-letter.patchapplication/octet-stream; name=v3-0000-cover-letter.patchDownload

From ccdf02bfdcca1807d9fe6bd1e39b0b185f81e5e6 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Thu, 28 Aug 2025 15:40:48 +0800
Subject: [PATCH v3 0/2] Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

This patch depends on:
  [PATCH v8] Implement WAIT FOR command
  https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO%2BBBjcirozJ6nYbOW8Q%40mail.gmail.com

Summary:
--------
This patch replaces the polling loop in read_local_xlog_page_guts()
with latch-based infrastructure, building on the WAIT FOR command
introduced by Kartyshov Ivan and Alexander Korotkov in the above patch.
The polling loop was inefficient during long waits; this version
integrates latches for more efficient wakeups.

Credit:
-------
This work builds on the infrastructure by Kartyshov Ivan and Alexander
Korotkov. Credit goes to them for the foundational patch.

Testing:
--------
- Passes `make check-world`
- Shows reduced CPU usage when waiting for WAL in performance tests.

Application:
------------
To apply:
  1. First apply v8-0001-Implement-WAIT-FOR-command.patch
  2. Then apply this patch series

Thanks,
Xuneng

alterego665 (1):
  Improve read_local_xlog_page_guts by replacing polling with
    latch-based waiting

 src/backend/access/transam/xlog.c             | 18 ++++
 src/backend/access/transam/xlogutils.c        | 48 ++++++++--
 src/backend/access/transam/xlogwait.c         | 93 ++++++++++++++++---
 src/backend/replication/walsender.c           |  4 -
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/access/xlogwait.h                 |  1 +
 6 files changed, 142 insertions(+), 23 deletions(-)

-- 
2.49.0

v3-0002-Improve-read_local_xlog_page_guts-by-replacing-po.patchapplication/octet-stream; name=v3-0002-Improve-read_local_xlog_page_guts-by-replacing-po.patchDownload

From ccdf02bfdcca1807d9fe6bd1e39b0b185f81e5e6 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Thu, 28 Aug 2025 15:38:47 +0800
Subject: [PATCH v3 2/2] Improve read_local_xlog_page_guts by replacing polling
 with latch-based waiting

Replace inefficient polling loops in read_local_xlog_page_guts with latch-based waiting
when WAL data is not yet available.  This eliminates CPU-intensive busy waiting and improves
responsiveness by waking processes immediately when their target LSN becomes available.
---
 src/backend/access/transam/xlog.c             | 18 ++++
 src/backend/access/transam/xlogutils.c        | 48 ++++++++--
 src/backend/access/transam/xlogwait.c         | 93 ++++++++++++++++---
 src/backend/replication/walsender.c           |  4 -
 .../utils/activity/wait_event_names.txt       |  1 +
 src/include/access/xlogwait.h                 |  1 +
 6 files changed, 142 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f5257dfa689..4af8f22f166 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2913,6 +2913,15 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+		WaitLSNWakeup(LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3088,6 +3097,15 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+		WaitLSNWakeup(LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 27ea52fdfee..b7ba3f6a737 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -927,7 +923,6 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
-
 			if (loc <= read_upto)
 				break;
 
@@ -947,7 +942,44 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSNFlush(loc);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSNReplay(loc, 0);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						/* Shouldn't happen without timeout */
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 2cc9312e836..70241bd384f 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -1,8 +1,8 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.c
- *	  Implements waiting for the given replay LSN, which is used in
- *	  WAIT FOR lsn '...'
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *	  Used by WAIT FOR lsn '...' and internal WAL reading operations.
  *
  * Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -10,10 +10,11 @@
  *	  src/backend/access/transam/xlogwait.c
  *
  * NOTES
- *		This file implements waiting for the replay of the given LSN on a
- *		physical standby.  The core idea is very small: every backend that
- *		wants to wait publishes the LSN it needs to the shared memory, and
- *		the startup process wakes it once that LSN has been replayed.
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
@@ -23,14 +24,18 @@
  *
  *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
  *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch before it wakens up by a startup
+ *		memory and waits on the latch before it wakens up by the appropriate
  *		process, timeout is reached, standby is promoted, or the postmaster
  *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		After replaying a WAL record, the startup process first performs a
- *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
- *		it checks waitersHeap and wakes up the backend whose awaited LSNs
- *		are reached.
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
  *
  *-------------------------------------------------------------------------
  */
@@ -386,3 +391,69 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 
 	return WAIT_LSN_RESULT_SUCCESS;
 }
+
+/*
+ * Wait for LSN to be flushed on primary server.
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to waiters */
+	addLSNWaiter(targetLSN);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);	
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap. We might
+	 * already be deleted by the waker process. The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	return;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0855bae3535..5fb74088fcd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1021,10 +1021,6 @@ StartReplication(StartReplicationCmd *cmd)
 /*
  * XLogReaderRoutine->page_read callback for logical decoding contexts, as a
  * walsender process.
- *
- * Inside the walsender we can do better than read_local_xlog_page,
- * which has to do a plain sleep/busy loop, because the walsender's latch gets
- * set every time WAL is flushed.
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index ee20a48b2c5..b2266666c02 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 72be2f76293..742c47609ae 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -86,5 +86,6 @@ extern void WaitLSNShmemInit(void);
 extern void WaitLSNWakeup(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
-- 
2.49.0

Xuneng Zhou

xunengzhou@gmail.com

4 months ago

In reply to: Xuneng Zhou (#3)

3 attachment(s)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Some changes in v3:
1) Update the note of xlogwait.c to reflect the extending of its use
for flush waiting and internal use for both flush and replay waiting.
2) Update the comment above logical_read_xlog_page which describes the
prior-change behavior of read_local_xlog_page.

In an off-list discussion, Alexander pointed out potential issues with
the current single-heap design for replay and flush when promotion
occurs concurrently with WAIT FOR. The following is a simple example
illustrating the problem:

During promotion, there's a window where we can have mixed waiter
types in the same heap:

T1: Process A calls read_local_xlog_page_guts on standby
T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter
T3: Promotion begins
T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr)
T5: SharedRecoveryState = RECOVERY_STATE_DONE
T6: Process B calls read_local_xlog_page_guts
T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter

The problem is that replay LSNs and flush LSNs represent different
positions in the WAL stream. Having both types in the same heap can
lead to:
- Incorrect wakeup logic (comparing incomparable LSNs)
- Processes waiting forever
- Wrong waiters being woken up

To avoid this problem, patch v4 is updated to utilize two separate
heaps for flush and replay like Alexander suggested earlier. It also
introduces a new separate min LSN tracking field for flushing.

Best,
Xuneng

Attachments:

v4-0000-cover-letter.patchapplication/octet-stream; name=v4-0000-cover-letter.patchDownload

From ccdf02bfdcca1807d9fe6bd1e39b0b185f81e5e6 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Thu, 28 Aug 2025 15:40:48 +0800
Subject: [PATCH v4 0/2] Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

This patch depends on:
  [PATCH v11] Implement WAIT FOR command
  https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO%2BBBjcirozJ6nYbOW8Q%40mail.gmail.com

Summary:
--------
This patch replaces the polling loop in read_local_xlog_page_guts()
with latch-based infrastructure, building on the WAIT FOR command
introduced by Kartyshov Ivan and Alexander Korotkov in the above patch.
The polling loop was inefficient during long waits; this version
integrates latches for more efficient wakeups.

Application:
------------
To apply:
  1. First apply v11-0001-Implement-WAIT-FOR-command.patch
  2. Then apply v4-0002-Improve-read_local_xlog_page_guts-by-replacing-po.patch

Thanks,
Xuneng

v11-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v11-0001-Implement-WAIT-FOR-command.patchDownload

From 0ee9a9275cd811f70a49560e0715556820fb81be Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sat, 27 Sep 2025 23:26:22 +0800
Subject: [PATCH v11] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 234 +++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 388 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 212 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  33 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  93 +++++
 src/include/commands/wait.h                   |  22 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   8 +
 src/include/parser/kwlist.h                   |   2 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl       | 293 +++++++++++++
 src/tools/pgindent/typedefs.list              |   5 +
 30 files changed, 1435 insertions(+), 14 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 109713315c0..36b8ac6b855 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6222,6 +6223,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..824b0942b34 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char       *timeout_str;
+			const char *hintmsg;
+			double      result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round
+			 * into range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+							defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..fd95f24fa74 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -319,6 +319,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -671,7 +672,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -741,7 +741,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -785,7 +785,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1113,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16403,6 +16404,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			opt_with '(' utility_option_list ')'	{ $$ = $3; }
+			| /*EMPTY*/							    { $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17882,6 +17903,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18051,6 +18073,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18497,6 +18520,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18708,6 +18732,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +356,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/*
+	 * A pairing heap node for participation in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f1706df58fd..997c72ab858 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4363,4 +4363,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..69a81e21fbb 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -269,6 +269,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -494,6 +495,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d5a80b4359f..ac0252936be 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3257,7 +3257,12 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
+WaitStmtParam
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.51.0

v4-0002-Improve-read_local_xlog_page_guts-by-replacing-po.patchapplication/octet-stream; name=v4-0002-Improve-read_local_xlog_page_guts-by-replacing-po.patchDownload

From eb30d509886b4ff1a908b11d23fa46a5ad751e8b Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 28 Sep 2025 18:44:54 +0800
Subject: [PATCH v4 2/2] Improve read_local_xlog_page_guts by replacing polling
  with latch-based waiting

Replace inefficient polling loops in read_local_xlog_page_guts with latch-based waiting
when WAL data is not yet available.  This eliminates CPU-intensive busy waiting and improves
responsiveness by waking processes immediately when their target LSN becomes available.
---
 src/backend/access/transam/xlog.c             |  20 +-
 src/backend/access/transam/xlogrecovery.c     |   4 +-
 src/backend/access/transam/xlogutils.c        |  48 ++-
 src/backend/access/transam/xlogwait.c         | 322 +++++++++++++-----
 src/backend/replication/walsender.c           |   4 -
 .../utils/activity/wait_event_names.txt       |   1 +
 src/include/access/xlogwait.h                 |  58 ++--
 7 files changed, 342 insertions(+), 115 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 36b8ac6b855..76c5ad7ae26 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2913,6 +2913,15 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3095,6 +3104,15 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6227,7 +6245,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(InvalidXLogRecPtr);
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 824b0942b34..1859d2084e8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
-				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 38176d9688e..0ea02a45c6b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -927,7 +923,6 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
-
 			if (loc <= read_upto)
 				break;
 
@@ -947,7 +942,44 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSNFlush(loc);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSNReplay(loc, 0);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						/* Shouldn't happen without timeout */
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 4d831fbfa74..e0ac2620bd5 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -1,8 +1,8 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.c
- *	  Implements waiting for the given replay LSN, which is used in
- *	  WAIT FOR lsn '...'
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *	  Used by WAIT FOR lsn '...' and internal WAL reading operations.
  *
  * Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -10,10 +10,11 @@
  *	  src/backend/access/transam/xlogwait.c
  *
  * NOTES
- *		This file implements waiting for the replay of the given LSN on a
- *		physical standby.  The core idea is very small: every backend that
- *		wants to wait publishes the LSN it needs to the shared memory, and
- *		the startup process wakes it once that LSN has been replayed.
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
@@ -23,14 +24,18 @@
  *
  *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
  *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch before it wakens up by a startup
+ *		memory and waits on the latch before it wakens up by the appropriate
  *		process, timeout is reached, standby is promoted, or the postmaster
  *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		After replaying a WAL record, the startup process first performs a
- *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
- *		it checks waitersHeap and wakes up the backend whose awaited LSNs
- *		are reached.
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
  *
  *-------------------------------------------------------------------------
  */
@@ -81,22 +86,46 @@ WaitLSNShmemInit(void)
 														  &found);
 	if (!found)
 	{
-		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
-		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
 		memset(&waitLSNState->procInfos, 0,
 			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
 	}
 }
 
 /*
- * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
- * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ * Comparison function for waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ * This function works for both replay and flush heaps.
  */
 static int
 waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 {
-	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
-	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+	const WaitLSNProcInfo *aproc;
+	const WaitLSNProcInfo *bproc;
+
+	/*
+	 * We need to determine which heap node we're comparing.
+	 * Since both heap nodes are at different offsets in WaitLSNProcInfo,
+	 * we use the arg parameter to distinguish between them.
+	 */
+	if ((uintptr_t)arg == WAIT_LSN_REPLAY)
+	{
+		aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+		bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+	}
+	else
+	{
+		aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+		bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+	}
 
 	if (aproc->waitLSN < bproc->waitLSN)
 		return 1;
@@ -107,65 +136,88 @@ waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 }
 
 /*
- * Update waitReplayLSN->minWaitedLSN according to the current state of
- * waitReplayLSN->waitersHeap.
+ * Update minimum waited LSN for the specified operation type
  */
 static void
-updateMinWaitedLSN(void)
+updateMinWaitedLSN(WaitLSNOperation operation)
 {
-	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
 
-	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	if (operation == WAIT_LSN_REPLAY)
 	{
-		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
-
-		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
 	}
-
-	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
 }
 
 /*
- * Put the current process into the heap of LSN waiters.
+ * Add current process to appropriate waiters heap based on operation type
  */
 static void
-addLSNWaiter(XLogRecPtr lsn)
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
 {
 	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
 
 	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
-	Assert(!procInfo->inHeap);
-
 	procInfo->procno = MyProcNumber;
 	procInfo->waitLSN = lsn;
 
-	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
-	procInfo->inHeap = true;
-	updateMinWaitedLSN();
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
 
 	LWLockRelease(WaitLSNLock);
 }
 
 /*
- * Remove the current process from the heap of LSN waiters if it's there.
+ * Remove current process from appropriate waiters heap based on operation type
  */
 static void
-deleteLSNWaiter(void)
+deleteLSNWaiter(WaitLSNOperation operation)
 {
 	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
 
 	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
-	if (!procInfo->inHeap)
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
 	{
-		LWLockRelease(WaitLSNLock);
-		return;
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
 	}
-
-	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
-	procInfo->inHeap = false;
-	updateMinWaitedLSN();
 
 	LWLockRelease(WaitLSNLock);
 }
@@ -177,7 +229,7 @@ deleteLSNWaiter(void)
 #define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
 
 /*
- * Remove waiters whose LSN has been replayed from the heap and set their
+ * Remove waiters whose LSN has been reached from the heap and set their
  * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
  * and set latches for all waiters.
  *
@@ -188,12 +240,18 @@ deleteLSNWaiter(void)
  * if there are more waiters, this function will loop to process them in
  * multiple chunks.
  */
-void
-WaitLSNWakeup(XLogRecPtr currentLSN)
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
 {
-	int			i;
-	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
-	int			numWakeUpProcs;
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
 
 	do
 	{
@@ -201,35 +259,42 @@ WaitLSNWakeup(XLogRecPtr currentLSN)
 		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 		/*
-		 * Iterate the pairing heap of waiting processes till we find LSN not
-		 * yet replayed.  Record the process numbers to wake up, but to avoid
-		 * holding the lock for too long, send the wakeups only after
-		 * releasing the lock.
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
 		 */
-		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		while (!pairingheap_is_empty(heap))
 		{
-			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
-			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
 
-			if (!XLogRecPtrIsInvalid(currentLSN) &&
-				procInfo->waitLSN > currentLSN)
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
 				break;
 
 			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
 			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
-			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
-			procInfo->inHeap = false;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
 
 			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
 				break;
 		}
 
-		updateMinWaitedLSN();
-
+		updateMinWaitedLSN(operation);
 		LWLockRelease(WaitLSNLock);
 
 		/*
-		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * Set latches for processes, whose waited LSNs are already reached.
 		 * As the time consuming operations, we do this outside of
 		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
 		 * freed, so we just can potentially set the wrong process' (or no
@@ -238,25 +303,54 @@ WaitLSNWakeup(XLogRecPtr currentLSN)
 		for (i = 0; i < numWakeUpProcs; i++)
 			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
 
-		/* Need to recheck if there were more waiters than static array size. */
-	}
-	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
 }
 
 /*
- * Delete our item from shmem array if any.
+ * Clean up LSN waiters for exiting process
  */
 void
 WaitLSNCleanup(void)
 {
-	/*
-	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
-	 * flag is set to true only by the process itself.  So, it's only possible
-	 * to get a false positive.  But that will be eliminated by a recheck
-	 * inside deleteLSNWaiter().
-	 */
-	if (waitLSNState->procInfos[MyProcNumber].inHeap)
-		deleteLSNWaiter();
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
 }
 
 /*
@@ -308,11 +402,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 	}
 
 	/*
-	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * Add our process to the replay waiters heap.  It might happen that
 	 * target LSN gets replayed before we do.  Another check at the beginning
 	 * of the loop below prevents the race condition.
 	 */
-	addLSNWaiter(targetLSN);
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
 
 	for (;;)
 	{
@@ -326,7 +420,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 			 * Recovery was ended, but recheck if target LSN was already
 			 * replayed.  See the comment regarding deleteLSNWaiter() below.
 			 */
-			deleteLSNWaiter();
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
 			currentLSN = GetXLogReplayRecPtr(NULL);
 			if (PromoteIsTriggered() && targetLSN <= currentLSN)
 				return WAIT_LSN_RESULT_SUCCESS;
@@ -372,11 +466,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 	}
 
 	/*
-	 * Delete our process from the shared memory pairing heap.  We might
-	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
 	 * us from the double deletion.
 	 */
-	deleteLSNWaiter();
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
 
 	/*
 	 * If we didn't reach the target LSN, we must be exited by timeout.
@@ -386,3 +480,69 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 
 	return WAIT_LSN_RESULT_SUCCESS;
 }
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 59822f22b8d..9955e829190 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1022,10 +1022,6 @@ StartReplication(StartReplicationCmd *cmd)
 /*
  * XLogReaderRoutine->page_read callback for logical decoding contexts, as a
  * walsender process.
- *
- * Inside the walsender we can do better than read_local_xlog_page,
- * which has to do a plain sleep/busy loop, because the walsender's latch gets
- * set every time WAL is flushed.
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index eb77924c4be..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index df8202528b9..f9c303a8c7f 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -30,49 +30,67 @@ typedef enum
 										 * wait */
 } WaitLSNResult;
 
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
- * about the single process, which may wait for LSN replay.  An item of
- * waitLSN->procInfos array.
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
  */
 typedef struct WaitLSNProcInfo
 {
 	/* LSN, which this process is waiting for */
 	XLogRecPtr	waitLSN;
 
-	/* Process to wake up once the waitLSN is replayed */
+	/* Process to wake up once the waitLSN is reached */
 	ProcNumber	procno;
 
-	/*
-	 * A flag indicating that this item is present in
-	 * waitReplayLSNState->waitersHeap
-	 */
-	bool		inHeap;
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
 
-	/*
-	 * A pairing heap node for participation in
-	 * waitReplayLSNState->waitersHeap
-	 */
-	pairingheap_node phNode;
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
 } WaitLSNProcInfo;
 
 /*
- * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
  */
 typedef struct WaitLSNState
 {
 	/*
-	 * The minimum LSN value some process is waiting for.  Used for the
+	 * The minimum replay LSN value some process is waiting for.  Used for the
 	 * fast-path checking if we need to wake up any waiters after replaying a
 	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
 	 */
-	pg_atomic_uint64 minWaitedLSN;
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
 
 	/*
-	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
 	 * on top).  Protected by WaitLSNLock.
 	 */
-	pairingheap waitersHeap;
+	pairingheap flushWaitersHeap;
 
 	/*
 	 * An array with per-process information, indexed by the process number.
@@ -86,8 +104,10 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
-extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
-- 
2.51.0

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Xuneng Zhou (#4)

3 attachment(s)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Sun, Sep 28, 2025 at 9:47 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

Some changes in v3:
1) Update the note of xlogwait.c to reflect the extending of its use
for flush waiting and internal use for both flush and replay waiting.
2) Update the comment above logical_read_xlog_page which describes the
prior-change behavior of read_local_xlog_page.

In an off-list discussion, Alexander pointed out potential issues with
the current single-heap design for replay and flush when promotion
occurs concurrently with WAIT FOR. The following is a simple example
illustrating the problem:

During promotion, there's a window where we can have mixed waiter
types in the same heap:

T1: Process A calls read_local_xlog_page_guts on standby
T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter
T3: Promotion begins
T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr)
T5: SharedRecoveryState = RECOVERY_STATE_DONE
T6: Process B calls read_local_xlog_page_guts
T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter

The problem is that replay LSNs and flush LSNs represent different
positions in the WAL stream. Having both types in the same heap can
lead to:
- Incorrect wakeup logic (comparing incomparable LSNs)
- Processes waiting forever
- Wrong waiters being woken up

To avoid this problem, patch v4 is updated to utilize two separate
heaps for flush and replay like Alexander suggested earlier. It also
introduces a new separate min LSN tracking field for flushing.

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

Best,
Xuneng

Attachments:

v5-0000-cover-letter.patchapplication/octet-stream; name=v5-0000-cover-letter.patchDownload

From ccdf02bfdcca1807d9fe6bd1e39b0b185f81e5e6 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Thu, 28 Aug 2025 15:40:48 +0800
Subject: [PATCH v5 0/2] Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

This patch depends on:
  [PATCH v11] Implement WAIT FOR command
  https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO%2BBBjcirozJ6nYbOW8Q%40mail.gmail.com

Summary:
--------
This patch replaces the polling loop in read_local_xlog_page_guts()
with latch-based infrastructure, building on the WAIT FOR command
introduced by Kartyshov Ivan and Alexander Korotkov in the above patch.
The polling loop was inefficient during long waits; this version
integrates latches for more efficient wakeups.

Application:
------------
To apply:
  1. First apply v11-0001-Implement-WAIT-FOR-command.patch
  2. Then apply v5-0002-Improve-read_local_xlog_page_guts-by-replacing-po.patch

Thanks,
Xuneng

v11-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v11-0001-Implement-WAIT-FOR-command.patchDownload

From 0ee9a9275cd811f70a49560e0715556820fb81be Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sat, 27 Sep 2025 23:26:22 +0800
Subject: [PATCH v11] Implement WAIT FOR command

WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.
---
 doc/src/sgml/high-availability.sgml           |  54 +++
 doc/src/sgml/ref/allfiles.sgml                |   1 +
 doc/src/sgml/ref/wait_for.sgml                | 234 +++++++++++
 doc/src/sgml/reference.sgml                   |   1 +
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogwait.c         | 388 ++++++++++++++++++
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/wait.c                   | 212 ++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/parser/gram.y                     |  33 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |  12 +-
 src/backend/tcop/utility.c                    |  22 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/access/xlogwait.h                 |  93 +++++
 src/include/commands/wait.h                   |  22 +
 src/include/lib/pairingheap.h                 |   3 +
 src/include/nodes/parsenodes.h                |   8 +
 src/include/parser/kwlist.h                   |   2 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/tcop/cmdtaglist.h                 |   1 +
 src/test/recovery/meson.build                 |   3 +-
 src/test/recovery/t/049_wait_for_lsn.pl       | 293 +++++++++++++
 src/tools/pgindent/typedefs.list              |   5 +
 30 files changed, 1435 insertions(+), 14 deletions(-)
 create mode 100644 doc/src/sgml/ref/wait_for.sgml
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/backend/commands/wait.c
 create mode 100644 src/include/access/xlogwait.h
 create mode 100644 src/include/commands/wait.h
 create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="read-your-writes-consistency">
+   <title>Read-Your-Writes Consistency</title>
+
+   <para>
+    In asynchronous replication, there is always a short window where changes
+    on the primary may not yet be visible on the standby due to replication
+    lag. This can lead to inconsistencies when an application writes data on
+    the primary and then immediately issues a read query on the standby.
+    However, it is possible to address this without switching to synchronous
+    replication.
+   </para>
+
+   <para>
+    To address this, PostgreSQL offers a mechanism for read-your-writes
+    consistency. The key idea is to ensure that a client sees its own writes
+    by synchronizing the WAL replay on the standby with the known point of
+    change on the primary.
+   </para>
+
+   <para>
+    This is achieved by the following steps.  After performing write
+    operations, the application retrieves the current WAL location using a
+    function call like this.
+
+    <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+    </programlisting>
+   </para>
+
+   <para>
+    The <acronym>LSN</acronym> obtained from the primary is then communicated
+    to the standby server. This can be managed at the application level or
+    via the connection pooler.  On the standby, the application issues the
+    <xref linkend="sql-wait-for"/> command to block further processing until
+    the standby's WAL replay process reaches (or exceeds) the specified
+    <acronym>LSN</acronym>.
+
+    <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+    </programlisting>
+    Once the command returns a status of success, it guarantees that all
+    changes up to the provided <acronym>LSN</acronym> have been applied,
+    ensuring that subsequent read queries will reflect the latest updates.
+   </para>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous Archiving in Standby</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY waitFor            SYSTEM "wait_for.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+    TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+    NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+    Waits until recovery replays <parameter>lsn</parameter>.
+    If no <parameter>timeout</parameter> is specified or it is set to
+    zero, this command waits indefinitely for the
+    <parameter>lsn</parameter>.
+    On timeout, or if the server is promoted before
+    <parameter>lsn</parameter> is reached, an error is emitted,
+    unless <literal>NO_THROW</literal> is specified in the WITH clause.
+    If <parameter>NO_THROW</parameter> is specified, then the command
+    doesn't throw errors.
+  </para>
+
+  <para>
+    The possible return values are <literal>success</literal>,
+    <literal>timeout</literal>, and <literal>not in recovery</literal>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">lsn</replaceable></term>
+    <listitem>
+     <para>
+      Specifies the target <acronym>LSN</acronym> to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+    <listitem>
+     <para>
+      This clause specifies optional parameters for the wait operation.
+      The following parameters are supported:
+
+      <variablelist>
+       <varlistentry>
+        <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+        <listitem>
+         <para>
+          When specified and <parameter>timeout</parameter> is greater than zero,
+          the command waits until <parameter>lsn</parameter> is reached or
+          the specified <parameter>timeout</parameter> has elapsed.
+         </para>
+         <para>
+          The <parameter>timeout</parameter> might be given as integer number of
+          milliseconds.  Also it might be given as string literal with
+          integer number of milliseconds or a number with unit
+          (see <xref linkend="config-setting-names-values"/>).
+         </para>
+        </listitem>
+       </varlistentry>
+
+       <varlistentry>
+        <term><literal>NO_THROW</literal></term>
+        <listitem>
+         <para>
+          Specify to not throw an error in the case of timeout or
+          running on the primary.  In this case the result status can be get from
+          the return value.
+         </para>
+        </listitem>
+       </varlistentry>
+      </variablelist>
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Outputs</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>success</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that we have successfully reached
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>timeout</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the timeout happened before reaching
+      the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>not in recovery</literal></term>
+    <listitem>
+     <para>
+      This return value denotes that the database server is not in a recovery
+      state.  This might mean either the database server was not in recovery
+      at the moment of receiving the command, or it was promoted before
+      reaching the target <parameter>lsn</parameter>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+    <command>WAIT FOR</command> command waits till
+    <parameter>lsn</parameter> to be replayed on standby.
+    That is, after this command execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case, the <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+  </para>
+
+  <para>
+    <command>WAIT FOR</command> command should be called on standby.
+    If a user runs <command>WAIT FOR</command> on primary, it
+    will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+    However, if <command>WAIT FOR</command> is
+    called on primary promoted from standby and <literal>lsn</literal>
+    was already replayed, then the <command>WAIT FOR</command> command just
+    exits immediately.
+  </para>
+
+</refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+    You can use <command>WAIT FOR</command> command to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+   Then an application could run <command>WAIT FOR</command>
+   with the <parameter>lsn</parameter> obtained from primary.  After that the
+   changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+  </para>
+
+  <para>
+    If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+  </para>
+
+  <para>
+   The same example uses <command>WAIT FOR</command> with
+   <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &waitFor;
 
  </reference>
 
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 109713315c0..36b8ac6b855 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -6222,6 +6223,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeup(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..824b0942b34 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for the replay of the given LSN on a
+ *		physical standby.  The core idea is very small: every backend that
+ *		wants to wait publishes the LSN it needs to the shared memory, and
+ *		the startup process wakes it once that LSN has been replayed.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by a startup
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		After replaying a WAL record, the startup process first performs a
+ *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
+ *		it checks waitersHeap and wakes up the backend whose awaited LSNs
+ *		are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+	int			i;
+	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int			numWakeUpProcs;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the pairing heap of waiting processes till we find LSN not
+		 * yet replayed.  Record the process numbers to wake up, but to avoid
+		 * holding the lock for too long, send the wakeups only after
+		 * releasing the lock.
+		 */
+		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) &&
+				procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+			procInfo->inHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN();
+
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+		/* Need to recheck if there were more waiters than static array size. */
+	}
+	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter();
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+	XLogRecPtr	lsn;
+	int64		timeout = 0;
+	WaitLSNResult waitLSNResult;
+	bool		throw = true;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	const char *result = "<unset>";
+	bool		timeout_specified = false;
+	bool		no_throw_specified = false;
+
+	/* Parse and validate the mandatory LSN */
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+										  CStringGetDatum(stmt->lsn_literal)));
+
+	foreach_node(DefElem, defel, stmt->options)
+	{
+		if (strcmp(defel->defname, "timeout") == 0)
+		{
+			char       *timeout_str;
+			const char *hintmsg;
+			double      result;
+
+			if (timeout_specified)
+				errorConflictingDefElem(defel, pstate);
+			timeout_specified = true;
+
+			timeout_str = defGetString(defel);
+
+			if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+			{
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("invalid timeout value: \"%s\"", timeout_str),
+						hintmsg ? errhint("%s", _(hintmsg)) : 0);
+			}
+
+			/*
+			 * Get rid of any fractional part in the input. This is so we
+			 * don't fail on just-out-of-range values that would round
+			 * into range.
+			 */
+			result = rint(result);
+
+			/* Range check */
+			if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+				ereport(ERROR,
+						errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+						errmsg("timeout value is out of range"));
+
+			if (result < 0)
+				ereport(ERROR,
+						errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						errmsg("timeout cannot be negative"));
+
+			timeout = (int64) result;
+		}
+		else if (strcmp(defel->defname, "no_throw") == 0)
+		{
+			if (no_throw_specified)
+				errorConflictingDefElem(defel, pstate);
+
+			no_throw_specified = true;
+
+			throw = !defGetBoolean(defel);
+		}
+		else
+		{
+			ereport(ERROR,
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("option \"%s\" not recognized",
+							defel->defname),
+					parser_errposition(pstate, defel->location));
+		}
+	}
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * WAIT FOR is a command, not a procedure or function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered snapshot. */
+	if (HaveRegisteredOrActiveSnapshot())
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+				errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+	/*
+	 * Process the result of WaitForLSNReplay().  Throw appropriate error if
+	 * needed.
+	 */
+	switch (waitLSNResult)
+	{
+		case WAIT_LSN_RESULT_SUCCESS:
+			/* Nothing to do on success */
+			result = "success";
+			break;
+
+		case WAIT_LSN_RESULT_TIMEOUT:
+			if (throw)
+				ereport(ERROR,
+						errcode(ERRCODE_QUERY_CANCELED),
+						errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+							   LSN_FORMAT_ARGS(lsn),
+							   LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+			else
+				result = "timeout";
+			break;
+
+		case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+			if (throw)
+			{
+				if (PromoteIsTriggered())
+				{
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+									  LSN_FORMAT_ARGS(lsn),
+									  LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+				}
+				else
+					ereport(ERROR,
+							errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+							errmsg("recovery is not in progress"),
+							errhint("Waiting for the replay LSN can only be executed during recovery."));
+			}
+			else
+				result = "not in recovery";
+			break;
+	}
+
+	/* need a tuple descriptor representing a single TEXT column */
+	tupdesc = WaitStmtResultDesc(stmt);
+
+	/* prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+	/* Send it */
+	do_text_output_oneline(tstate, result);
+
+	end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+	TupleDesc	tupdesc;
+
+	/* Need a tuple descriptor representing a single TEXT  column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+					   TEXTOID, -1, 0);
+	return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..fd95f24fa74 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -319,6 +319,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <boolean>		opt_concurrently
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
+%type <list>		opt_wait_with_clause
 %type <list>		utility_option_list
 %type <defelt>		utility_option_elem
 %type <str>			utility_option_name
@@ -671,7 +672,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 				json_object_constructor_null_clause_opt
 				json_array_constructor_null_clause_opt
 
-
 /*
  * Non-keyword token types.  These are hard-wired into the "flex" lexer.
  * They must be listed first so that their numeric codes do not depend on
@@ -741,7 +741,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -785,7 +785,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1113,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -16403,6 +16404,26 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+			WAIT FOR LSN_P Sconst opt_wait_with_clause
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn_literal = $4;
+					n->options = $5;
+					$$ = (Node *) n;
+				}
+			;
+
+opt_wait_with_clause:
+			opt_with '(' utility_option_list ')'	{ $$ = $3; }
+			| /*EMPTY*/							    { $$ = NIL; }
+			;
 
 /*
  * Aggregate decoration clauses
@@ -17882,6 +17903,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18051,6 +18073,7 @@ unreserved_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -18497,6 +18520,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN_P
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18708,6 +18732,7 @@ bare_label_keyword:
 			| VIEWS
 			| VIRTUAL
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+	 * stack from under us, so don't complain if it's now empty.  Otherwise,
+	 * our snapshot should be the top one; pop it.  Note that this could be a
+	 * different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
 		IsA(utilityStmt, ListenStmt) ||
 		IsA(utilityStmt, NotifyStmt) ||
 		IsA(utilityStmt, UnlistenStmt) ||
-		IsA(utilityStmt, CheckPointStmt))
+		IsA(utilityStmt, CheckPointStmt) ||
+		IsA(utilityStmt, WaitStmt))
 		return false;
 
 	return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
 		case T_VariableSetStmt:
+		case T_WaitStmt:
 			{
 				/*
 				 * These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+			}
+			break;
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
 		case T_VariableShowStmt:
 			return true;
 
+		case T_WaitStmt:
+			return true;
+
 		default:
 			return false;
 	}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
 				return GetPGVariableResultDesc(n->name);
 			}
 
+		case T_WaitStmt:
+			return WaitStmtResultDesc((WaitStmt *) parsetree);
+
 		default:
 			return NULL;
 	}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
 			}
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_DDL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 			/* already-planned queries */
 		case T_PlannedStmt:
 			{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +356,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is replayed */
+	ProcNumber	procno;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	bool		inHeap;
+
+	/*
+	 * A pairing heap node for participation in
+	 * waitReplayLSNState->waitersHeap
+	 */
+	pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif							/* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f1706df58fd..997c72ab858 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4363,4 +4363,12 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn_literal;	/* LSN string from grammar */
+	List	   *options;		/* List of DefElem nodes */
+} WaitStmt;
+
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..69a81e21fbb 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -269,6 +269,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -494,6 +495,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
       't/045_archive_restartpoint.pl',
       't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
-      't/048_vacuum_horizon_floor.pl'
+      't/048_vacuum_horizon_floor.pl',
+      't/049_wait_for_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+	"standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}';
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+	"standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /WAIT FOR must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+	stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+	"get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+	stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+	"get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN 'invalid_lsn';",
+	stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+	"get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+	stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+	"get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+	"WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+	'postgres',
+	"WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+	stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+	"get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		WAIT FOR LSN '${lsn}';
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+	"WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d5a80b4359f..ac0252936be 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3257,7 +3257,12 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
 WaitPMResult
+WaitStmt
+WaitStmtParam
 WalCloseMethod
 WalCompression
 WalInsertClass
-- 
2.51.0

v5-0002-Improve-read_local_xlog_page_guts-by-replacing-po.patchapplication/octet-stream; name=v5-0002-Improve-read_local_xlog_page_guts-by-replacing-po.patchDownload

From a8c9055ecd252166f009c9b94120deb636d1c4e0 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 28 Sep 2025 18:44:54 +0800
Subject: [PATCH v5] Improve read_local_xlog_page_guts by replacing polling 
 with latch-based waiting

Replace inefficient polling loops in read_local_xlog_page_guts with latch-based waiting
when WAL data is not yet available.  This eliminates CPU-intensive busy waiting and improves
responsiveness by waking processes immediately when their target LSN becomes available.
---
 src/backend/access/transam/xlog.c             |  20 +-
 src/backend/access/transam/xlogrecovery.c     |   4 +-
 src/backend/access/transam/xlogutils.c        |  48 ++-
 src/backend/access/transam/xlogwait.c         | 329 +++++++++++++-----
 src/backend/replication/walsender.c           |   4 -
 .../utils/activity/wait_event_names.txt       |   1 +
 src/include/access/xlogwait.h                 |  58 ++-
 7 files changed, 347 insertions(+), 117 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 36b8ac6b855..76c5ad7ae26 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2913,6 +2913,15 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3095,6 +3104,15 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6227,7 +6245,7 @@ StartupXLOG(void)
 	 * Wake up all waiters for replay LSN.  They need to report an error that
 	 * recovery was ended before reaching the target LSN.
 	 */
-	WaitLSNWakeup(InvalidXLogRecPtr);
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
 
 	/*
 	 * Shutdown the recovery environment.  This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 824b0942b34..1859d2084e8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
 			 */
 			if (waitLSNState &&
 				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
-				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
-				WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
 
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 38176d9688e..0ea02a45c6b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -927,7 +923,6 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
-
 			if (loc <= read_upto)
 				break;
 
@@ -947,7 +942,44 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSNFlush(loc);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSNReplay(loc, 0);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						/* Shouldn't happen without timeout */
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 4d831fbfa74..3916a9163d5 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -1,8 +1,8 @@
 /*-------------------------------------------------------------------------
  *
  * xlogwait.c
- *	  Implements waiting for the given replay LSN, which is used in
- *	  WAIT FOR lsn '...'
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *	  Used by WAIT FOR lsn '...' and internal WAL reading operations.
  *
  * Copyright (c) 2025, PostgreSQL Global Development Group
  *
@@ -10,10 +10,11 @@
  *	  src/backend/access/transam/xlogwait.c
  *
  * NOTES
- *		This file implements waiting for the replay of the given LSN on a
- *		physical standby.  The core idea is very small: every backend that
- *		wants to wait publishes the LSN it needs to the shared memory, and
- *		the startup process wakes it once that LSN has been replayed.
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
  *
  *		The shared memory used by this module comprises a procInfos
  *		per-backend array with the information of the awaited LSN for each
@@ -23,14 +24,18 @@
  *
  *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
  *		waiter process publishes information about itself to the shared
- *		memory and waits on the latch before it wakens up by a startup
+ *		memory and waits on the latch before it wakens up by the appropriate
  *		process, timeout is reached, standby is promoted, or the postmaster
  *		dies.  Then, it cleans information about itself in the shared memory.
  *
- *		After replaying a WAL record, the startup process first performs a
- *		fast path check minWaitedLSN > replayLSN.  If this check is negative,
- *		it checks waitersHeap and wakes up the backend whose awaited LSNs
- *		are reached.
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
  *
  *-------------------------------------------------------------------------
  */
@@ -53,8 +58,10 @@
 #include "utils/snapmgr.h"
 
 
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
 
-static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
 						void *arg);
 
 struct WaitLSNState *waitLSNState = NULL;
@@ -81,22 +88,29 @@ WaitLSNShmemInit(void)
 														  &found);
 	if (!found)
 	{
-		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
-		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
 		memset(&waitLSNState->procInfos, 0,
 			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
 	}
 }
 
 /*
- * Comparison function for waitReplayLSN->waitersHeap heap.  Waiting processes are
- * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
  */
 static int
-waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 {
-	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
-	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
 
 	if (aproc->waitLSN < bproc->waitLSN)
 		return 1;
@@ -107,65 +121,106 @@ waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 }
 
 /*
- * Update waitReplayLSN->minWaitedLSN according to the current state of
- * waitReplayLSN->waitersHeap.
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
  */
 static void
-updateMinWaitedLSN(void)
+updateMinWaitedLSN(WaitLSNOperation operation)
 {
-	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
 
-	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	if (operation == WAIT_LSN_REPLAY)
 	{
-		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
-
-		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
 	}
-
-	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
 }
 
 /*
- * Put the current process into the heap of LSN waiters.
+ * Add current process to appropriate waiters heap based on operation type
  */
 static void
-addLSNWaiter(XLogRecPtr lsn)
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
 {
 	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
 
 	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
-	Assert(!procInfo->inHeap);
-
 	procInfo->procno = MyProcNumber;
 	procInfo->waitLSN = lsn;
 
-	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
-	procInfo->inHeap = true;
-	updateMinWaitedLSN();
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
 
 	LWLockRelease(WaitLSNLock);
 }
 
 /*
- * Remove the current process from the heap of LSN waiters if it's there.
+ * Remove current process from appropriate waiters heap based on operation type
  */
 static void
-deleteLSNWaiter(void)
+deleteLSNWaiter(WaitLSNOperation operation)
 {
 	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
 
 	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
-	if (!procInfo->inHeap)
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
 	{
-		LWLockRelease(WaitLSNLock);
-		return;
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
 	}
-
-	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
-	procInfo->inHeap = false;
-	updateMinWaitedLSN();
 
 	LWLockRelease(WaitLSNLock);
 }
@@ -177,7 +232,7 @@ deleteLSNWaiter(void)
 #define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
 
 /*
- * Remove waiters whose LSN has been replayed from the heap and set their
+ * Remove waiters whose LSN has been reached from the heap and set their
  * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
  * and set latches for all waiters.
  *
@@ -188,12 +243,18 @@ deleteLSNWaiter(void)
  * if there are more waiters, this function will loop to process them in
  * multiple chunks.
  */
-void
-WaitLSNWakeup(XLogRecPtr currentLSN)
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
 {
-	int			i;
-	ProcNumber	wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
-	int			numWakeUpProcs;
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
 
 	do
 	{
@@ -201,35 +262,42 @@ WaitLSNWakeup(XLogRecPtr currentLSN)
 		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 		/*
-		 * Iterate the pairing heap of waiting processes till we find LSN not
-		 * yet replayed.  Record the process numbers to wake up, but to avoid
-		 * holding the lock for too long, send the wakeups only after
-		 * releasing the lock.
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
 		 */
-		while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+		while (!pairingheap_is_empty(heap))
 		{
-			pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
-			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
 
-			if (!XLogRecPtrIsInvalid(currentLSN) &&
-				procInfo->waitLSN > currentLSN)
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
 				break;
 
 			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
 			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
-			(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
-			procInfo->inHeap = false;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
 
 			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
 				break;
 		}
 
-		updateMinWaitedLSN();
-
+		updateMinWaitedLSN(operation);
 		LWLockRelease(WaitLSNLock);
 
 		/*
-		 * Set latches for processes, whose waited LSNs are already replayed.
+		 * Set latches for processes, whose waited LSNs are already reached.
 		 * As the time consuming operations, we do this outside of
 		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
 		 * freed, so we just can potentially set the wrong process' (or no
@@ -238,25 +306,54 @@ WaitLSNWakeup(XLogRecPtr currentLSN)
 		for (i = 0; i < numWakeUpProcs; i++)
 			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
 
-		/* Need to recheck if there were more waiters than static array size. */
-	}
-	while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
 }
 
 /*
- * Delete our item from shmem array if any.
+ * Clean up LSN waiters for exiting process
  */
 void
 WaitLSNCleanup(void)
 {
-	/*
-	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
-	 * flag is set to true only by the process itself.  So, it's only possible
-	 * to get a false positive.  But that will be eliminated by a recheck
-	 * inside deleteLSNWaiter().
-	 */
-	if (waitLSNState->procInfos[MyProcNumber].inHeap)
-		deleteLSNWaiter();
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
 }
 
 /*
@@ -308,11 +405,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 	}
 
 	/*
-	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * Add our process to the replay waiters heap.  It might happen that
 	 * target LSN gets replayed before we do.  Another check at the beginning
 	 * of the loop below prevents the race condition.
 	 */
-	addLSNWaiter(targetLSN);
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
 
 	for (;;)
 	{
@@ -326,7 +423,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 			 * Recovery was ended, but recheck if target LSN was already
 			 * replayed.  See the comment regarding deleteLSNWaiter() below.
 			 */
-			deleteLSNWaiter();
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
 			currentLSN = GetXLogReplayRecPtr(NULL);
 			if (PromoteIsTriggered() && targetLSN <= currentLSN)
 				return WAIT_LSN_RESULT_SUCCESS;
@@ -372,11 +469,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 	}
 
 	/*
-	 * Delete our process from the shared memory pairing heap.  We might
-	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
 	 * us from the double deletion.
 	 */
-	deleteLSNWaiter();
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
 
 	/*
 	 * If we didn't reach the target LSN, we must be exited by timeout.
@@ -386,3 +483,69 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 
 	return WAIT_LSN_RESULT_SUCCESS;
 }
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 59822f22b8d..9955e829190 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1022,10 +1022,6 @@ StartReplication(StartReplicationCmd *cmd)
 /*
  * XLogReaderRoutine->page_read callback for logical decoding contexts, as a
  * walsender process.
- *
- * Inside the walsender we can do better than read_local_xlog_page,
- * which has to do a plain sleep/busy loop, because the walsender's latch gets
- * set every time WAL is flushed.
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index eb77924c4be..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
 WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index df8202528b9..f9c303a8c7f 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -30,49 +30,67 @@ typedef enum
 										 * wait */
 } WaitLSNResult;
 
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
 /*
  * WaitLSNProcInfo - the shared memory structure representing information
- * about the single process, which may wait for LSN replay.  An item of
- * waitLSN->procInfos array.
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
  */
 typedef struct WaitLSNProcInfo
 {
 	/* LSN, which this process is waiting for */
 	XLogRecPtr	waitLSN;
 
-	/* Process to wake up once the waitLSN is replayed */
+	/* Process to wake up once the waitLSN is reached */
 	ProcNumber	procno;
 
-	/*
-	 * A flag indicating that this item is present in
-	 * waitReplayLSNState->waitersHeap
-	 */
-	bool		inHeap;
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
 
-	/*
-	 * A pairing heap node for participation in
-	 * waitReplayLSNState->waitersHeap
-	 */
-	pairingheap_node phNode;
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
 } WaitLSNProcInfo;
 
 /*
- * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
  */
 typedef struct WaitLSNState
 {
 	/*
-	 * The minimum LSN value some process is waiting for.  Used for the
+	 * The minimum replay LSN value some process is waiting for.  Used for the
 	 * fast-path checking if we need to wake up any waiters after replaying a
 	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
 	 */
-	pg_atomic_uint64 minWaitedLSN;
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
 
 	/*
-	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
 	 * on top).  Protected by WaitLSNLock.
 	 */
-	pairingheap waitersHeap;
+	pairingheap flushWaitersHeap;
 
 	/*
 	 * An array with per-process information, indexed by the process number.
@@ -86,8 +104,10 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
 
 extern Size WaitLSNShmemSize(void);
 extern void WaitLSNShmemInit(void);
-extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
 extern void WaitLSNCleanup(void);
 extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
 
 #endif							/* XLOG_WAIT_H */
-- 
2.51.0

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Xuneng Zhou (#5)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote:

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

The primary goal that you want to achieve here is a replacement of the
wait/sleep logic of read_local_xlog_page_guts() with a condition
variable, and design a new facility to make the callback more
responsive on polls. That's a fine idea in itself. However I would
suggest to implement something that does not depend entirely on WAIT
FOR, which is how your patch is presented. Instead of having your
patch depend on an entirely different feature, it seems to me that you
should try to extract from this other feature the basics that you are
looking for, and make them shared between the WAIT FOR patch and what
you are trying to achieve here. You should not need something as
complex as what the other feature needs for a page read callback in
the backend.

At the end, I suspect that you will reuse a slight portion of it (or
perhaps nothing at all, actually, but I did not look at the full scope
of it). You should try to present your patch so as it is in a
reviewable state, so as others would be able to read it and understand
it. WAIT FOR is much more complex than what you want to do here
because it covers larger areas of the code base and needs to worry
about more cases. So, you should implement things so as the basic
pieces you want to build on top of are simpler, not more complicated.
Simpler means easier to review and easier to catch problems, designed
in a way that depends on how you want to fix your problem, not
designed in a way that depends on how a completely different feature
deals with its own problems.
--
Michael

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Michael Paquier (#6)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi Michael,

Thanks for your review.

On Fri, Oct 3, 2025 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote:

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

The primary goal that you want to achieve here is a replacement of the
wait/sleep logic of read_local_xlog_page_guts() with a condition
variable, and design a new facility to make the callback more
responsive on polls. That's a fine idea in itself. However I would
suggest to implement something that does not depend entirely on WAIT
FOR, which is how your patch is presented. Instead of having your
patch depend on an entirely different feature, it seems to me that you
should try to extract from this other feature the basics that you are
looking for, and make them shared between the WAIT FOR patch and what
you are trying to achieve here. You should not need something as
complex as what the other feature needs for a page read callback in
the backend.

At the end, I suspect that you will reuse a slight portion of it (or
perhaps nothing at all, actually, but I did not look at the full scope
of it). You should try to present your patch so as it is in a
reviewable state, so as others would be able to read it and understand
it. WAIT FOR is much more complex than what you want to do here
because it covers larger areas of the code base and needs to worry
about more cases. So, you should implement things so as the basic
pieces you want to build on top of are simpler, not more complicated.
Simpler means easier to review and easier to catch problems, designed
in a way that depends on how you want to fix your problem, not
designed in a way that depends on how a completely different feature
deals with its own problems.

The core infrastructure shared by both this patch and the WAIT FOR
command patch is primarily in xlogwait.c, which provides the mechanism
for waiting until a given LSN is reached. Other parts of the code in
the WAIT FOR patch—covering the SQL command implementation,
documentation, and tests—is not relevant for the current patch. What
we need is only the infrastructure in xlogwait.c, on which we can
implement the optimization for read_local_xlog_page_guts.

Regarding complexity: the initial optimization idea was to introduce
condition-variable based waiting, as Heikki suggested in his comment:

/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

I reviewed the relevant code in WalSndWakeup and WalSndWait. While
these mechanisms reduce polling overhead, they don’t prevent false
wakeups. Addressing that would likely require a request queue that
maps waiters to their target LSNs and issues targeted wakeups—a much
more complex design. Given that read_local_xlog_page_guts is not as
performance-sensitive as its equivalents, this added complexity may
not be justified. So I implemented the initial version of the
optimization like WalSndWakeup and WalSndWait.

After this, I came across the WAIT FOR patch in the mailing list and
noticed that the infrastructure in xlogwait.c aligns well with our
needs. Based on that, I built the current patch using this shared
infra.

If we prioritise simplicity and can tolerate occasional false wakeups,
then waiting in read_local_xlog_page_guts can be implemented in a much
simpler way than the current version. At the same time, the WAIT FOR
command seems to be a valuable feature in its own right, and both
patches can naturally share the same infrastructure. We could also
extract the infra and implement the current patch on it, then Wait for
could utilize it as well. Personally, I don’t have a strong preference
between the two approaches.

Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Xuneng Zhou (#7)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Fri, Oct 3, 2025 at 9:50 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Michael,

Thanks for your review.

On Fri, Oct 3, 2025 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote:

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

The primary goal that you want to achieve here is a replacement of the
wait/sleep logic of read_local_xlog_page_guts() with a condition
variable, and design a new facility to make the callback more
responsive on polls. That's a fine idea in itself. However I would
suggest to implement something that does not depend entirely on WAIT
FOR, which is how your patch is presented. Instead of having your
patch depend on an entirely different feature, it seems to me that you
should try to extract from this other feature the basics that you are
looking for, and make them shared between the WAIT FOR patch and what
you are trying to achieve here. You should not need something as
complex as what the other feature needs for a page read callback in
the backend.

At the end, I suspect that you will reuse a slight portion of it (or
perhaps nothing at all, actually, but I did not look at the full scope
of it). You should try to present your patch so as it is in a
reviewable state, so as others would be able to read it and understand
it. WAIT FOR is much more complex than what you want to do here
because it covers larger areas of the code base and needs to worry
about more cases. So, you should implement things so as the basic
pieces you want to build on top of are simpler, not more complicated.
Simpler means easier to review and easier to catch problems, designed
in a way that depends on how you want to fix your problem, not
designed in a way that depends on how a completely different feature
deals with its own problems.

The core infrastructure shared by both this patch and the WAIT FOR
command patch is primarily in xlogwait.c, which provides the mechanism
for waiting until a given LSN is reached. Other parts of the code in
the WAIT FOR patch—covering the SQL command implementation,
documentation, and tests—is not relevant for the current patch. What
we need is only the infrastructure in xlogwait.c, on which we can
implement the optimization for read_local_xlog_page_guts.

Regarding complexity: the initial optimization idea was to introduce
condition-variable based waiting, as Heikki suggested in his comment:

/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

I reviewed the relevant code in WalSndWakeup and WalSndWait. While
these mechanisms reduce polling overhead, they don’t prevent false
wakeups. Addressing that would likely require a request queue that
maps waiters to their target LSNs and issues targeted wakeups—a much
more complex design. Given that read_local_xlog_page_guts is not as
performance-sensitive as its equivalents, this added complexity may
not be justified. So I implemented the initial version of the
optimization like WalSndWakeup and WalSndWait.

After this, I came across the WAIT FOR patch in the mailing list and
noticed that the infrastructure in xlogwait.c aligns well with our
needs. Based on that, I built the current patch using this shared
infra.

If we prioritise simplicity and can tolerate occasional false wakeups,
then waiting in read_local_xlog_page_guts can be implemented in a much
simpler way than the current version. At the same time, the WAIT FOR
command seems to be a valuable feature in its own right, and both
patches can naturally share the same infrastructure. We could also
extract the infra and implement the current patch on it, then Wait for
could utilize it as well. Personally, I don’t have a strong preference
between the two approaches.

Another potential use for this infra could be static XLogRecPtr
WalSndWaitForWal(XLogRecPtr loc), I'm planning to hack a version as
well.

Best,
Xuneng

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Xuneng Zhou (#8)

1 attachment(s)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Sat, Oct 4, 2025 at 10:25 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Fri, Oct 3, 2025 at 9:50 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Michael,

Thanks for your review.

On Fri, Oct 3, 2025 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote:

v5-0002 separates the waitlsn_cmp() comparator function into two distinct
functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay
and flush heaps, respectively.

The primary goal that you want to achieve here is a replacement of the
wait/sleep logic of read_local_xlog_page_guts() with a condition
variable, and design a new facility to make the callback more
responsive on polls. That's a fine idea in itself. However I would
suggest to implement something that does not depend entirely on WAIT
FOR, which is how your patch is presented. Instead of having your
patch depend on an entirely different feature, it seems to me that you
should try to extract from this other feature the basics that you are
looking for, and make them shared between the WAIT FOR patch and what
you are trying to achieve here. You should not need something as
complex as what the other feature needs for a page read callback in
the backend.

At the end, I suspect that you will reuse a slight portion of it (or
perhaps nothing at all, actually, but I did not look at the full scope
of it). You should try to present your patch so as it is in a
reviewable state, so as others would be able to read it and understand
it. WAIT FOR is much more complex than what you want to do here
because it covers larger areas of the code base and needs to worry
about more cases. So, you should implement things so as the basic
pieces you want to build on top of are simpler, not more complicated.
Simpler means easier to review and easier to catch problems, designed
in a way that depends on how you want to fix your problem, not
designed in a way that depends on how a completely different feature
deals with its own problems.

The core infrastructure shared by both this patch and the WAIT FOR
command patch is primarily in xlogwait.c, which provides the mechanism
for waiting until a given LSN is reached. Other parts of the code in
the WAIT FOR patch—covering the SQL command implementation,
documentation, and tests—is not relevant for the current patch. What
we need is only the infrastructure in xlogwait.c, on which we can
implement the optimization for read_local_xlog_page_guts.

Regarding complexity: the initial optimization idea was to introduce
condition-variable based waiting, as Heikki suggested in his comment:

/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/

I reviewed the relevant code in WalSndWakeup and WalSndWait. While
these mechanisms reduce polling overhead, they don’t prevent false
wakeups. Addressing that would likely require a request queue that
maps waiters to their target LSNs and issues targeted wakeups—a much
more complex design. Given that read_local_xlog_page_guts is not as
performance-sensitive as its equivalents, this added complexity may
not be justified. So I implemented the initial version of the
optimization like WalSndWakeup and WalSndWait.

After this, I came across the WAIT FOR patch in the mailing list and
noticed that the infrastructure in xlogwait.c aligns well with our
needs. Based on that, I built the current patch using this shared
infra.

If we prioritise simplicity and can tolerate occasional false wakeups,
then waiting in read_local_xlog_page_guts can be implemented in a much
simpler way than the current version. At the same time, the WAIT FOR
command seems to be a valuable feature in its own right, and both
patches can naturally share the same infrastructure. We could also
extract the infra and implement the current patch on it, then Wait for
could utilize it as well. Personally, I don’t have a strong preference
between the two approaches.

Another potential use for this infra could be static XLogRecPtr
WalSndWaitForWal(XLogRecPtr loc), I'm planning to hack a version as
well.

Best,
Xuneng

v6 refactors and extends the infrastructure from the WAIT FOR command
patch, applying it to read_local_xlog_page_guts. I'm also thinking of
creating a standalone patch/commit for the extended
infra in xlogwait, so it can be reused in different threads.

Best,
Xuneng

Attachments:

v6-0001-Replace-inefficient-polling-loops-in-read_local_x.patchapplication/x-patch; name=v6-0001-Replace-inefficient-polling-loops-in-read_local_x.patchDownload

From a6b3ce216180b8d9f2570d46dc9a767b626d5cda Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sat, 4 Oct 2025 14:52:11 +0800
Subject: [PATCH v6] Replace inefficient polling loops in
 read_local_xlog_page_guts with latch-based waiting when WAL data is not yet
 available.  This eliminates CPU-intensive busy waiting and improves
 responsiveness by waking processes immediately when their target LSN becomes
 available.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay or flush of WAL, waiters whose LSNs have already been replayed or flushed
are deleted from the shared memory pairing heap and woken up by setting their latches.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Xuneng Zhou <xunengzhou@gmail.com>
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |  25 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/access/transam/xlogutils.c        |  48 +-
 src/backend/access/transam/xlogwait.c         | 551 ++++++++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/replication/walsender.c           |   4 -
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 | 113 ++++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 15 files changed, 781 insertions(+), 15 deletions(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..cff53106f76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -2912,6 +2913,15 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3094,6 +3104,15 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6225,6 +6244,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..1859d2084e8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 38176d9688e..0ea02a45c6b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -927,7 +923,6 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
-
 			if (loc <= read_upto)
 				break;
 
@@ -947,7 +942,44 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSNFlush(loc);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSNReplay(loc, 0);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						/* Shouldn't happen without timeout */
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..b4d5e9354ef
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,551 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *	  Used by internal WAL reading operations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, timeout is reached, standby is promoted, or the postmaster
+ *		dies.  Then, it cleans information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+	}
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+	{
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(heap))
+		{
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.  Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed.  Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the replay waiters heap.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * reached.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+	/*
+	 * If we didn't reach the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+		return WAIT_LSN_RESULT_TIMEOUT;
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 59822f22b8d..9955e829190 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1022,10 +1022,6 @@ StartReplication(StartReplicationCmd *cmd)
 /*
  * XLogReaderRoutine->page_read callback for logical decoding contexts, as a
  * walsender process.
- *
- * Inside the walsender we can do better than read_local_xlog_page,
- * which has to do a plain sleep/busy loop, because the walsender's latch gets
- * set every time WAL is flushed.
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..f9c303a8c7f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,113 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_TIMEOUT,	/* Timeout occurred */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
+
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum replay LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
+
+	/*
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap flushWaitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
-- 
2.51.0

#10

Michael Paquier

michael@paquier.xyz

3 months ago

In reply to: Xuneng Zhou (#9)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

On Sat, Oct 04, 2025 at 03:21:07PM +0800, Xuneng Zhou wrote:

v6 refactors and extends the infrastructure from the WAIT FOR command
patch, applying it to read_local_xlog_page_guts. I'm also thinking of
creating a standalone patch/commit for the extended
infra in xlogwait, so it can be reused in different threads.

Yes, I think that you should split your patch where you think that it
can make review easier, because your change touches a very sensitive
area of the code base:
- First patch tointroduce what you consider is the basic
infrastructure required for your patch, that can be shared between
multiple pieces. I doubt that you really need to have everything's
that in waitlsn.c to achieve what you want here.
- Second patch to introduce your actual feature, to make the callback
more responsive.
- Then, potentially have a third patch, that adds pieces of
infrastructure to waitlsn.c that you did not need in the first patch,
still are required for the waitlsn.c thread. It would be optionally
possible to rebase the waitlsn patch to use patches 1 and 3, then.

I'd even try to consider the problem from the angle of looking for
independent pieces that could be extracted from the first patch and
split as other patches, to ease even again the review. There is a
limit to this idea because you need a push/pull/reporting facility for
a flush LSN and a replay LSN depending on if you are on a primary, on
a standby, and even another case where you are dealing with a promoted
standby where you decide to loop back *inside* the callback (which I
suspect may not be always the right thing to do depending on the new
TLI selected), so there is a limit in what could be treated as an
independent piece. At least the bits about pairingheap_initialize()
may be worth considering.

+                       if (waitLSNState &&
+                           (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+                            pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+                           WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);

This code pattern looks like a copy-paste of what's done in
synchronous replication. Has some consolidation between syncrep.c and
this kind of facility ever been considered? In terms of queues, waits
and wakeups, the requirements are pretty similar, still your patch has
zero changes related to syncrep.c or syncrep.h.

As far as I can see based on your patch, you are repeating some of the
mistakes of the wait LSN patch, where I've complained about
WaitForLSNReplay() and the duplication it had. One thing you have
decided to pull for example is duplicated calls to
GetXLogReplayRecPtr().
--
Michael

#11

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Michael Paquier (#10)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi Michael,

Thanks again for your insightful review.

On Mon, Oct 6, 2025 at 10:43 AM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Oct 04, 2025 at 03:21:07PM +0800, Xuneng Zhou wrote:

v6 refactors and extends the infrastructure from the WAIT FOR command
patch, applying it to read_local_xlog_page_guts. I'm also thinking of
creating a standalone patch/commit for the extended
infra in xlogwait, so it can be reused in different threads.

Yes, I think that you should split your patch where you think that it
can make review easier, because your change touches a very sensitive
area of the code base:
- First patch tointroduce what you consider is the basic
infrastructure required for your patch, that can be shared between
multiple pieces. I doubt that you really need to have everything's
that in waitlsn.c to achieve what you want here.
- Second patch to introduce your actual feature, to make the callback
more responsive.
- Then, potentially have a third patch, that adds pieces of
infrastructure to waitlsn.c that you did not need in the first patch,
still are required for the waitlsn.c thread. It would be optionally
possible to rebase the waitlsn patch to use patches 1 and 3, then.

I'd even try to consider the problem from the angle of looking for
independent pieces that could be extracted from the first patch and
split as other patches, to ease even again the review. There is a
limit to this idea because you need a push/pull/reporting facility for
a flush LSN and a replay LSN depending on if you are on a primary, on
a standby, and even another case where you are dealing with a promoted
standby where you decide to loop back *inside* the callback (which I
suspect may not be always the right thing to do depending on the new
TLI selected), so there is a limit in what could be treated as an
independent piece. At least the bits about pairingheap_initialize()
may be worth considering.

+1 for further split to smooth the review process. The timeout in wait
for patch is not needed for the polling problem in the current thread.
I'll remove other unused mechanisms as well.

Yeh, just looping back *inside* the callback could be problematic if
the wait for LSNs don't exist on the current timeline. I'll add a
check for it.

The new patch set will look like this per your suggestion:

Patch 0: pairingheap infrastructure (independent)
src/backend/lib/pairingheap.c | +14 -4
src/include/lib/pairingheap.h | +3
Adds pairingheap_initialize() for shared memory usage.

Provides WaitForLSNReplay() and WaitForLSNFlush() for internal WAL consumers.

Patch 2: Replace polling in read_local_xlog_page_guts
src/backend/access/transam/xlogutils.c | +40 -5
src/backend/access/transam/xlog.c | +10
src/backend/access/transam/xlogrecovery.c | +6
src/backend/replication/walsender.c | -4

Uses Patch 1 infrastructure to eliminate busy-waiting.

Patch 3 Extend LSN waiting infrastructure that WAIT FOR needs

Patch Wait for command based on Patch 1 and 3
SQL interface, full error handling.

+                       if (waitLSNState &&
+                           (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+                            pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+                           WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
This code pattern looks like a copy-paste of what's done in
synchronous replication. Has some c between syncrep.c and
this kind of facility ever been considered? In terms of queues, waits
and wakeups, the requirements are pretty similar, still your patch has
zero changes related to syncrep.c or syncrep.h.

I'm not aware of this before; they do share some basic requirements here.
I'll explore the possibility of consolidating them.

As far as I can see based on your patch, you are repeating some of the
mistakes of the wait LSN patch, where I've complained about
WaitForLSNReplay() and the duplication it had. One thing you have
decided to pull for example is duplicated calls to
GetXLogReplayRecPtr().
--

Will refactor this.

Best,
Xuneng

#12

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Michael Paquier (#6)

3 attachment(s)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

The following is the split patch set. There are certain limitations to
this simplification effort, particularly in patch 2. The
read_local_xlog_page_guts callback demands more functionality from the
facility than the WAIT FOR patch — specifically, it must wait for WAL
flush events, though it does not require timeout handling. In some
sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
patch, since it installs wake-up hooks in more locations. Unlike the
WAIT FOR patch, which only needs wake-ups triggered by replay,
read_local_xlog_page_guts must also handle wake-ups triggered by WAL
flushes.

Workload characteristics play a key role here. A sorted dlist performs
well when insertions and removals occur in order, achieving O(1)
complexity in the best case. In synchronous replication, insertion
patterns seem generally monotonic with commit LSNs, though not
strictly ordered due to timing variations and contention. When most
insertions remain ordered, a dlist can be efficient. However, as the
number of elements grows and out-of-order insertions become more
frequent, the insertion cost can degrade to O(n) more often.

By contrast, a pairing heap maintains stable O(1) insertion for both
ordered and disordered inputs, with amortized O(log n) removals. Since
LSNs in the WAIT FOR command are likely to arrive in a non-sequential
fashion, the pairing heap introduced in v6 provides more predictable
performance under such workloads.

At this stage (v7), no consolidation between syncrep and xlogwait has
been implemented. This is mainly because the dlist and pairing heap
each works well under different workloads — neither is likely to be
universally optimal. Introducing the facility with a pairing heap
first seems reasonable, as it offers flexibility for future
refactoring: we could later replace dlist with a heap or adopt a
modular design depending on observed workload characteristics.

Best,
Xuneng

Attachments:

v7-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v7-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload

From 32dab7ed64eecb62adce6b1d124b1fa389515e74 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 10 Oct 2025 16:35:38 +0800
Subject: [PATCH v7 2/2] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 525 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 | 112 ++++
 src/include/storage/lwlocklist.h              |   1 +
 7 files changed, 647 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4faed65765c
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,525 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *	  Used by internal WAL reading operations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+	}
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+	{
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(heap))
+		{
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+		return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return WAIT_LSN_RESULT_SUCCESS;
+	}
+
+	/*
+	 * Add our process to the replay waiters heap.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..441bf475b4d
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
+
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum replay LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
+
+	/*
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap flushWaitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
-- 
2.51.0

v7-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchapplication/octet-stream; name=v7-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchDownload

From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v7] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

v7-0003-Improve-read_local_xlog_page_guts-by-replacing-po.patchapplication/octet-stream; name=v7-0003-Improve-read_local_xlog_page_guts-by-replacing-po.patchDownload

From 6b3d84a211e4e4e5d5d3682b159967bd6278cbc6 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 10 Oct 2025 16:48:45 +0800
Subject: [PATCH v7 3/3] Improve read_local_xlog_page_guts by replacing polling
 with latch-based waiting.

Replace inefficient polling loops in read_local_xlog_page_guts with facilities developed in xlogwaitmodule when WAL data is not yet available.  This eliminates CPU-intensive busy waiting and improves
responsiveness by waking processes immediately when their target LSN becomes available.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Xuneng Zhou <xunengzhou@gmail.com>
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 25 ++++++++++++
 src/backend/access/transam/xlogrecovery.c | 11 ++++++
 src/backend/access/transam/xlogutils.c    | 47 +++++++++++++++++++----
 src/backend/replication/walsender.c       |  4 --
 src/backend/storage/lmgr/proc.c           |  6 +++
 6 files changed, 87 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..cff53106f76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -2912,6 +2913,15 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3094,6 +3104,15 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6225,6 +6244,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..1859d2084e8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 38176d9688e..df8d4629b6c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -927,7 +923,6 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
-
 			if (loc <= read_upto)
 				break;
 
@@ -947,7 +942,43 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSNFlush(loc);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSNReplay(loc);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 59822f22b8d..9955e829190 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1022,10 +1022,6 @@ StartReplication(StartReplicationCmd *cmd)
 /*
  * XLogReaderRoutine->page_read callback for logical decoding contexts, as a
  * walsender process.
- *
- * Inside the walsender we can do better than read_local_xlog_page,
- * which has to do a plain sleep/busy loop, because the walsender's latch gets
- * set every time WAL is flushed.
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
-- 
2.51.0

#13

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Xuneng Zhou (#12)

3 attachment(s)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

The following is the split patch set. There are certain limitations to
this simplification effort, particularly in patch 2. The
read_local_xlog_page_guts callback demands more functionality from the
facility than the WAIT FOR patch — specifically, it must wait for WAL
flush events, though it does not require timeout handling. In some
sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
patch, since it installs wake-up hooks in more locations. Unlike the
WAIT FOR patch, which only needs wake-ups triggered by replay,
read_local_xlog_page_guts must also handle wake-ups triggered by WAL
flushes.

Workload characteristics play a key role here. A sorted dlist performs
well when insertions and removals occur in order, achieving O(1)
complexity in the best case. In synchronous replication, insertion
patterns seem generally monotonic with commit LSNs, though not
strictly ordered due to timing variations and contention. When most
insertions remain ordered, a dlist can be efficient. However, as the
number of elements grows and out-of-order insertions become more
frequent, the insertion cost can degrade to O(n) more often.

By contrast, a pairing heap maintains stable O(1) insertion for both
ordered and disordered inputs, with amortized O(log n) removals. Since
LSNs in the WAIT FOR command are likely to arrive in a non-sequential
fashion, the pairing heap introduced in v6 provides more predictable
performance under such workloads.

At this stage (v7), no consolidation between syncrep and xlogwait has
been implemented. This is mainly because the dlist and pairing heap
each works well under different workloads — neither is likely to be
universally optimal. Introducing the facility with a pairing heap
first seems reasonable, as it offers flexibility for future
refactoring: we could later replace dlist with a heap or adopt a
modular design depending on observed workload characteristics.

v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay,
as the likelihood of a server state change is small compared to the
branching cost and added code complexity.

Best,
Xuneng

Attachments:

v8-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/x-patch; name=v8-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload

From 32dab7ed64eecb62adce6b1d124b1fa389515e74 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 10 Oct 2025 16:35:38 +0800
Subject: [PATCH v8 2/3] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 525 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 | 112 ++++
 src/include/storage/lwlocklist.h              |   1 +
 7 files changed, 627 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4faed65765c
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,525 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *	  Used by internal WAL reading operations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+	}
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+	{
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(heap))
+		{
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..441bf475b4d
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
+
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum replay LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
+
+	/*
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap flushWaitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
-- 
2.51.0

v8-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchapplication/x-patch; name=v8-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchDownload

From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v8 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

v8-0003-Improve-read_local_xlog_page_guts-by-replacing-po.patchapplication/x-patch; name=v8-0003-Improve-read_local_xlog_page_guts-by-replacing-po.patchDownload

From 6b3d84a211e4e4e5d5d3682b159967bd6278cbc6 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 10 Oct 2025 16:48:45 +0800
Subject: [PATCH v8 3/3] Improve read_local_xlog_page_guts by replacing polling
 with latch-based waiting.

Replace inefficient polling loops in read_local_xlog_page_guts with facilities developed in xlogwaitmodule when WAL data is not yet available.  This eliminates CPU-intensive busy waiting and improves
responsiveness by waking processes immediately when their target LSN becomes available.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Xuneng Zhou <xunengzhou@gmail.com>
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 25 ++++++++++++
 src/backend/access/transam/xlogrecovery.c | 11 ++++++
 src/backend/access/transam/xlogutils.c    | 47 +++++++++++++++++++----
 src/backend/replication/walsender.c       |  4 --
 src/backend/storage/lmgr/proc.c           |  6 +++
 6 files changed, 87 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..cff53106f76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -2912,6 +2913,15 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3094,6 +3104,15 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6225,6 +6244,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..1859d2084e8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 38176d9688e..df8d4629b6c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -927,7 +923,6 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
-
 			if (loc <= read_upto)
 				break;
 
@@ -947,7 +942,43 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSNFlush(loc);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSNReplay(loc);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 59822f22b8d..9955e829190 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1022,10 +1022,6 @@ StartReplication(StartReplicationCmd *cmd)
 /*
  * XLogReaderRoutine->page_read callback for logical decoding contexts, as a
  * walsender process.
- *
- * Inside the walsender we can do better than read_local_xlog_page,
- * which has to do a plain sleep/busy loop, because the walsender's latch gets
- * set every time WAL is flushed.
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
-- 
2.51.0

#14

Xuneng Zhou

xunengzhou@gmail.com

3 months ago

In reply to: Xuneng Zhou (#13)

3 attachment(s)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Wed, Oct 15, 2025 at 8:31 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

The following is the split patch set. There are certain limitations to
this simplification effort, particularly in patch 2. The
read_local_xlog_page_guts callback demands more functionality from the
facility than the WAIT FOR patch — specifically, it must wait for WAL
flush events, though it does not require timeout handling. In some
sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
patch, since it installs wake-up hooks in more locations. Unlike the
WAIT FOR patch, which only needs wake-ups triggered by replay,
read_local_xlog_page_guts must also handle wake-ups triggered by WAL
flushes.

Workload characteristics play a key role here. A sorted dlist performs
well when insertions and removals occur in order, achieving O(1)
complexity in the best case. In synchronous replication, insertion
patterns seem generally monotonic with commit LSNs, though not
strictly ordered due to timing variations and contention. When most
insertions remain ordered, a dlist can be efficient. However, as the
number of elements grows and out-of-order insertions become more
frequent, the insertion cost can degrade to O(n) more often.

By contrast, a pairing heap maintains stable O(1) insertion for both
ordered and disordered inputs, with amortized O(log n) removals. Since
LSNs in the WAIT FOR command are likely to arrive in a non-sequential
fashion, the pairing heap introduced in v6 provides more predictable
performance under such workloads.

At this stage (v7), no consolidation between syncrep and xlogwait has
been implemented. This is mainly because the dlist and pairing heap
each works well under different workloads — neither is likely to be
universally optimal. Introducing the facility with a pairing heap
first seems reasonable, as it offers flexibility for future
refactoring: we could later replace dlist with a heap or adopt a
modular design depending on observed workload characteristics.

v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay,
as the likelihood of a server state change is small compared to the
branching cost and added code complexity.

Made minor changes to #include of xlogwait.h in patch2 to calm CF-bots down.

Best,
Xuneng

Attachments:

v9-0003-Improve-read_local_xlog_page_guts-by-replacing-po.patchapplication/octet-stream; name=v9-0003-Improve-read_local_xlog_page_guts-by-replacing-po.patchDownload

From 6b3d84a211e4e4e5d5d3682b159967bd6278cbc6 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 10 Oct 2025 16:48:45 +0800
Subject: [PATCH v9 3/3] Improve read_local_xlog_page_guts by replacing polling
 with latch-based waiting.

Replace inefficient polling loops in read_local_xlog_page_guts with facilities developed in xlogwaitmodule when WAL data is not yet available.  This eliminates CPU-intensive busy waiting and improves
responsiveness by waking processes immediately when their target LSN becomes available.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Xuneng Zhou <xunengzhou@gmail.com>
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/xact.c         |  6 +++
 src/backend/access/transam/xlog.c         | 25 ++++++++++++
 src/backend/access/transam/xlogrecovery.c | 11 ++++++
 src/backend/access/transam/xlogutils.c    | 47 +++++++++++++++++++----
 src/backend/replication/walsender.c       |  4 --
 src/backend/storage/lmgr/proc.c           |  6 +++
 6 files changed, 87 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..cff53106f76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
@@ -2912,6 +2913,15 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3094,6 +3104,15 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >= pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN)))
+		WaitLSNWakeupFlush(LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
@@ -6225,6 +6244,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before reaching the target LSN.
+	 */
+	WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..1859d2084e8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecovery.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+				WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 38176d9688e..df8d4629b6c 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -927,7 +923,6 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 
 		if (state->currTLI == currTLI)
 		{
-
 			if (loc <= read_upto)
 				break;
 
@@ -947,7 +942,43 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSNFlush(loc);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSNReplay(loc);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 59822f22b8d..9955e829190 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1022,10 +1022,6 @@ StartReplication(StartReplicationCmd *cmd)
 /*
  * XLogReaderRoutine->page_read callback for logical decoding contexts, as a
  * walsender process.
- *
- * Inside the walsender we can do better than read_local_xlog_page,
- * which has to do a plain sleep/busy loop, because the walsender's latch gets
- * set every time WAL is flushed.
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
-- 
2.51.0

v9-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchapplication/octet-stream; name=v9-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchDownload

From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v9 1/3] Add pairingheap_initialize() for shared memory usage

The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
 src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
 src/include/lib/pairingheap.h |  3 +++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
-- 
2.51.0

v9-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v9-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload

From 39857e15fac0a7b5b3105b730db4dfb271788cca Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 15:47:27 +0800
Subject: [PATCH v9] Add infrastructure for efficient LSN waiting

Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com

Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/meson.build        |   1 +
 src/backend/access/transam/xlogwait.c         | 503 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   3 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/include/access/xlogwait.h                 | 112 ++++
 src/include/storage/lwlocklist.h              |   1 +
 7 files changed, 625 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/transam/xlogwait.c
 create mode 100644 src/include/access/xlogwait.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
 	xlogreader.o \
 	xlogrecovery.o \
 	xlogstats.o \
-	xlogutils.o
+	xlogutils.o \
+	xlogwait.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
   'xlogrecovery.c',
   'xlogstats.c',
   'xlogutils.c',
+  'xlogwait.c',
 )
 
 # used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..49dae7ac1c4
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ *	  Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ *		This file implements waiting for WAL operations to reach specific LSNs
+ *		on both physical standby and primary servers. The core idea is simple:
+ *		every process that wants to wait publishes the LSN it needs to the
+ *		shared memory, and the appropriate process (startup on standby, or
+ *		WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ *		The shared memory used by this module comprises a procInfos
+ *		per-backend array with the information of the awaited LSN for each
+ *		of the backend processes.  The elements of that array are organized
+ *		into a pairing heap waitersHeap, which allows for very fast finding
+ *		of the least awaited LSN.
+ *
+ *		In addition, the least-awaited LSN is cached as minWaitedLSN.  The
+ *		waiter process publishes information about itself to the shared
+ *		memory and waits on the latch before it wakens up by the appropriate
+ *		process, standby is promoted, or the postmaster	dies.  Then, it cleans
+ *		information about itself in the shared memory.
+ *
+ *		On standby servers: After replaying a WAL record, the startup process
+ *		first performs a fast path check minWaitedLSN > replayLSN.  If this
+ *		check is negative, it checks waitersHeap and wakes up the backend
+ *		whose awaited LSNs are reached.
+ *
+ *		On primary servers: After flushing WAL, the WAL writer or backend
+ *		process performs a similar check against the flush LSN and wakes up
+ *		waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int	waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+static int	waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+														  WaitLSNShmemSize(),
+														  &found);
+	if (!found)
+	{
+		/* Initialize replay heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+		/* Initialize flush heap and tracking */
+		pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+		/* Initialize process info array */
+		memset(&waitLSNState->procInfos, 0,
+			   (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+	XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+		{
+			pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+			WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+			minWaitedLSN = procInfo->waitLSN;
+		}
+		pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+	}
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	procInfo->procno = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	if (operation == WAIT_LSN_REPLAY)
+	{
+		Assert(!procInfo->inReplayHeap);
+		pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else /* WAIT_LSN_FLUSH */
+	{
+		Assert(!procInfo->inFlushHeap);
+		pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = true;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+	{
+		pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+		procInfo->inReplayHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_REPLAY);
+	}
+	else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+	{
+		pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+		procInfo->inFlushHeap = false;
+		updateMinWaitedLSN(WAIT_LSN_FLUSH);
+	}
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack.  It should be enough to take single iteration for most cases.
+ */
+#define	WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock.  The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE.  That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases.  However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+	ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+	int numWakeUpProcs;
+	int i;
+	pairingheap *heap;
+
+	/* Select appropriate heap */
+	heap = (operation == WAIT_LSN_REPLAY) ?
+		   &waitLSNState->replayWaitersHeap :
+		   &waitLSNState->flushWaitersHeap;
+
+	do
+	{
+		numWakeUpProcs = 0;
+		LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+		/*
+		 * Iterate the waiters heap until we find LSN not yet reached.
+		 * Record process numbers to wake up, but send wakeups after releasing lock.
+		 */
+		while (!pairingheap_is_empty(heap))
+		{
+			pairingheap_node *node = pairingheap_first(heap);
+			WaitLSNProcInfo *procInfo;
+
+			/* Get procInfo using appropriate heap node */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+			else
+				procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+			if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+				break;
+
+			Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+			wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+			(void) pairingheap_remove_first(heap);
+
+			/* Update appropriate flag */
+			if (operation == WAIT_LSN_REPLAY)
+				procInfo->inReplayHeap = false;
+			else
+				procInfo->inFlushHeap = false;
+
+			if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+				break;
+		}
+
+		updateMinWaitedLSN(operation);
+		LWLockRelease(WaitLSNLock);
+
+		/*
+		 * Set latches for processes, whose waited LSNs are already reached.
+		 * As the time consuming operations, we do this outside of
+		 * WaitLSNLock. This is  actually fine because procLatch isn't ever
+		 * freed, so we just can potentially set the wrong process' (or no
+		 * process') latch.
+		 */
+		for (i = 0; i < numWakeUpProcs; i++)
+			SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+	} while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+	/* Fast path check */
+	if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+		return;
+
+	wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (waitLSNState)
+	{
+		/*
+		 * We do a fast-path check of the heap flags without the lock.  These
+		 * flags are set to true only by the process itself.  So, it's only possible
+		 * to get a false positive.  But that will be eliminated by a recheck
+		 * inside deleteLSNWaiter().
+		 */
+		if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+		if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+			deleteLSNWaiter(WAIT_LSN_FLUSH);
+	}
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	/*
+	 * Add our process to the replay waiters heap.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+	for (;;)
+	{
+		int			rc;
+		currentLSN = GetXLogReplayRecPtr(NULL);
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.  See the comment regarding deleteLSNWaiter() below.
+			 */
+			deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+			if (PromoteIsTriggered() && targetLSN <= currentLSN)
+				return WAIT_LSN_RESULT_SUCCESS;
+			return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			if (targetLSN <= currentLSN)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN replay")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory replay heap.  We might
+	 * already be deleted by the startup process.  The 'inReplayHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+	return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+	XLogRecPtr	currentLSN;
+	int			wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+	/* We can only wait for flush when we are not in recovery */
+	Assert(!RecoveryInProgress());
+
+	/* Quick exit if already flushed */
+	currentLSN = GetFlushRecPtr(NULL);
+	if (targetLSN <= currentLSN)
+		return;
+
+	/* Add to flush waiters */
+	addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+	/* Wait loop */
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been flushed */
+		currentLSN = GetFlushRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, -1,
+					   WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+		/*
+		 * Emergency bailout if postmaster has died. This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			ereport(FATAL,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection due to unexpected postmaster exit"),
+					 errcontext("while waiting for LSN flush")));
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory flush heap. We might
+	 * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+	return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
 #include "access/twophase.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH	"Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY	"Waiting for WAL replay to reach a target LSN on a standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -355,6 +357,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..ada2a460ca4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+	WAIT_LSN_RESULT_SUCCESS,	/* Target LSN is reached */
+	WAIT_LSN_RESULT_NOT_IN_RECOVERY,	/* Recovery ended before or during our
+										 * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+	WAIT_LSN_REPLAY,	/* Waiting for replay on standby */
+	WAIT_LSN_FLUSH		/* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations.  An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* Process to wake up once the waitLSN is reached */
+	ProcNumber	procno;
+
+	/* Type-safe heap membership flags */
+	bool		inReplayHeap;	/* In replay waiters heap */
+	bool		inFlushHeap;	/* In flush waiters heap */
+
+	/* Separate heap nodes for type safety */
+	pairingheap_node replayHeapNode;
+	pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum replay LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedReplayLSN;
+
+	/*
+	 * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap replayWaitersHeap;
+
+	/*
+	 * The minimum flush LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after flushing
+	 * WAL.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedFlushLSN;
+
+	/*
+	 * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap flushWaitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif							/* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
-- 
2.51.0

#15

Xuneng Zhou

xunengzhou@gmail.com

2 months ago

In reply to: Xuneng Zhou (#14)

1 attachment(s)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Wed, Oct 15, 2025 at 4:43 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Wed, Oct 15, 2025 at 8:31 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,

The following is the split patch set. There are certain limitations to
this simplification effort, particularly in patch 2. The
read_local_xlog_page_guts callback demands more functionality from the
facility than the WAIT FOR patch — specifically, it must wait for WAL
flush events, though it does not require timeout handling. In some
sense, parts of patch 3 can be viewed as a superset of the WAIT FOR
patch, since it installs wake-up hooks in more locations. Unlike the
WAIT FOR patch, which only needs wake-ups triggered by replay,
read_local_xlog_page_guts must also handle wake-ups triggered by WAL
flushes.

Workload characteristics play a key role here. A sorted dlist performs
well when insertions and removals occur in order, achieving O(1)
complexity in the best case. In synchronous replication, insertion
patterns seem generally monotonic with commit LSNs, though not
strictly ordered due to timing variations and contention. When most
insertions remain ordered, a dlist can be efficient. However, as the
number of elements grows and out-of-order insertions become more
frequent, the insertion cost can degrade to O(n) more often.

By contrast, a pairing heap maintains stable O(1) insertion for both
ordered and disordered inputs, with amortized O(log n) removals. Since
LSNs in the WAIT FOR command are likely to arrive in a non-sequential
fashion, the pairing heap introduced in v6 provides more predictable
performance under such workloads.

At this stage (v7), no consolidation between syncrep and xlogwait has
been implemented. This is mainly because the dlist and pairing heap
each works well under different workloads — neither is likely to be
universally optimal. Introducing the facility with a pairing heap
first seems reasonable, as it offers flexibility for future
refactoring: we could later replace dlist with a heap or adopt a
modular design depending on observed workload characteristics.

v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay,
as the likelihood of a server state change is small compared to the
branching cost and added code complexity.

Made minor changes to #include of xlogwait.h in patch2 to calm CF-bots down.

Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
wake-up calls (447aae1) are in place, this patch has been updated to
make use of them.
Please check.

Best,
Xuneng

Attachments:

v10-0001-Improve-read_local_xlog_page_guts-by-replacing-p.patchapplication/x-patch; name=v10-0001-Improve-read_local_xlog_page_guts-by-replacing-p.patchDownload

From ec9625408c81ac78e4d7ecf6a459e258d56482fb Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 7 Nov 2025 21:33:05 +0800
Subject: [PATCH v10] Improve read_local_xlog_page_guts by replacing polling
 with latch-based waiting.

Replace inefficient polling loops in read_local_xlog_page_guts with facilities developed in
xlogwait module when WAL data is not yet available.  This eliminates CPU-intensive busy
waiting and improves responsiveness by waking processes immediately when their target
LSN becomes available.
---
 src/backend/access/transam/xlog.c      | 20 +++++++++++
 src/backend/access/transam/xlogutils.c | 46 ++++++++++++++++++++++----
 src/backend/replication/walsender.c    |  4 ---
 3 files changed, 59 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 101b616b028..f86f8b9d8cd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2915,6 +2915,16 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -3097,6 +3107,16 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests(true, !RecoveryInProgress());
 
+	/*
+	 * If we flushed an LSN that someone was waiting for then walk
+	 * over the shared memory array and set latches to notify the
+	 * waiters.
+	 */
+	if (waitLSNState &&
+		(LogwrtResult.Flush >=
+		 pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH])))
+		WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush);
+
 	/*
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index ce2a3e42146..39a18744cae 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,6 +23,7 @@
 #include "access/xlogrecovery.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "storage/smgr.h"
@@ -880,12 +881,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	loc = targetPagePtr + reqLen;
 
 	/*
-	 * Loop waiting for xlog to be available if necessary
-	 *
-	 * TODO: The walsender has its own version of this function, which uses a
-	 * condition variable to wake up whenever WAL is flushed. We could use the
-	 * same infrastructure here, instead of the check/sleep/repeat style of
-	 * loop.
+	 * Waiting for xlog to be available if necessary.
 	 */
 	while (1)
 	{
@@ -947,7 +943,43 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 			}
 
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+
+			/*
+			 * Wait for LSN using appropriate method based on server state.
+			 */
+			if (!RecoveryInProgress())
+			{
+				/* Primary: wait for flush */
+				WaitForLSN(WAIT_LSN_TYPE_FLUSH, loc, -1);
+			}
+			else
+			{
+				/* Standby: wait for replay */
+				WaitLSNResult result = WaitForLSN(WAIT_LSN_TYPE_REPLAY, loc, -1);
+
+				switch (result)
+				{
+					case WAIT_LSN_RESULT_SUCCESS:
+						/* LSN was replayed, loop back to recheck timeline */
+						break;
+
+					case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+						/*
+						 * Promoted while waiting. This is the tricky case.
+						 * We're now a primary, so loop back and use flush
+						 * logic instead of replay logic.
+						 */
+						break;
+
+					default:
+						elog(ERROR, "unexpected wait result");
+				}
+			}
+
+			/*
+			 * Loop back to recheck everything.
+			 * Timeline might have changed during our wait.
+			 */
 		}
 		else
 		{
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fc8f8559073..1821bf31539 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1036,10 +1036,6 @@ StartReplication(StartReplicationCmd *cmd)
 /*
  * XLogReaderRoutine->page_read callback for logical decoding contexts, as a
  * walsender process.
- *
- * Inside the walsender we can do better than read_local_xlog_page,
- * which has to do a plain sleep/busy loop, because the walsender's latch gets
- * set every time WAL is flushed.
  */
 static int
 logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
-- 
2.51.0

#16

Michael Paquier

michael@paquier.xyz

2 months ago

In reply to: Xuneng Zhou (#15)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

On Fri, Nov 07, 2025 at 09:48:23PM +0800, Xuneng Zhou wrote:

Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
wake-up calls (447aae1) are in place, this patch has been updated to
make use of them.
Please check.

That's indeed much simpler. I'll check later what you have here.
--
Michael

#17

Xuneng Zhou

xunengzhou@gmail.com

about 2 months ago

In reply to: Michael Paquier (#16)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi Michael,

On Sat, Nov 8, 2025 at 7:03 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 07, 2025 at 09:48:23PM +0800, Xuneng Zhou wrote:

Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
wake-up calls (447aae1) are in place, this patch has been updated to
make use of them.
Please check.

That's indeed much simpler. I'll check later what you have here.
--
Michael

Thanks again for your earlier suggestion on splitting the patches to
make the review process smoother.

Although this version is simpler in terms of the amount of code, the
review effort still feels non-trivial. During my own self-review, a
few points stood out as areas that merit careful consideration:

1) Reliance on the new wait-for-LSN infrastructure

The stability and correctness of this patch now depend heavily on the
newly added wait-for-LSN infrastructure, which has not yet been
battle-tested. This puts the patch in a bit of a dilemma: we want the
infrastructure to be as reliable as possible, but it could be hard to
fully validate its robustness without using it in real scenarios, even
after careful review.

2) Wake-up behavior

Are the waiting processes waking up at the correct points and under
the right conditions? Ensuring proper wake-ups is essential for both
correctness and performance.

3) Edge cases

Are edge cases—such as a promotion occurring while a process is
waiting in standby—handled correctly and without introducing races or
inconsistent states?

--
Best,
Xuneng

#18

Xuneng Zhou

xunengzhou@gmail.com

about 2 months ago

In reply to: Xuneng Zhou (#17)

Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting

Hi,

On Wed, Nov 19, 2025 at 11:44 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi Michael,

On Sat, Nov 8, 2025 at 7:03 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Nov 07, 2025 at 09:48:23PM +0800, Xuneng Zhou wrote:

Now that the LSN-waiting infrastructure (3b4e53a) and WAL replay
wake-up calls (447aae1) are in place, this patch has been updated to
make use of them.
Please check.

That's indeed much simpler. I'll check later what you have here.
--
Michael

Thanks again for your earlier suggestion on splitting the patches to
make the review process smoother.

Although this version is simpler in terms of the amount of code, the
review effort still feels non-trivial. During my own self-review, a
few points stood out as areas that merit careful consideration:

1) Reliance on the new wait-for-LSN infrastructure

The stability and correctness of this patch now depend heavily on the
newly added wait-for-LSN infrastructure, which has not yet been
battle-tested. This puts the patch in a bit of a dilemma: we want the
infrastructure to be as reliable as possible, but it could be hard to
fully validate its robustness without using it in real scenarios, even
after careful review.

Here are some of my incomplete interpretations of the behaviors:

2) Wake-up behavior

Are the waiting processes waking up at the correct points and under
the right conditions? Ensuring proper wake-ups is essential for both
correctness and performance.

Primary (Flush Wait):

The patch adds WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH, LogwrtResult.Flush)
in XLogFlush and XLogBackgroundFlush right after
/* wake up walsenders now that we've released heavily contended locks */
WalSndWakeupProcessRequests(true, !RecoveryInProgress());
where walsenders get notified.

Standby (Replay Wait):
-- The "End of Recovery" Wake-up
Location: xlog.c (inside StartupXLOG, around line 6266)

/*
* Wake up all waiters for replay LSN. They need to report an error that
* recovery was ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);

This call happens immediately after the server transitions from
"Recovery" to "Production" mode (RECOVERY_STATE_DONE).

-- The "Continuous Replay" Wake-up
Location: xlogrecovery.c (inside the main redo loop, around line 1850)
/*
* If we replayed an LSN that someone was waiting for then walk
* over the shared memory array and set latches to notify the
* waiters.
*/
if (waitLSNState &&
(XLogRecoveryCtl->lastReplayedEndRecPtr >=
pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);

It handles the continuous stream of updates during normal standby operation.

3) Edge cases

Are edge cases—such as a promotion occurring while a process is
waiting in standby—handled correctly and without introducing races or
inconsistent states?

Consider the following sequence which I traced through the logic:

1. Pre-Promotion: A backend (e.g., a logical decoding session) calls
read_local_xlog_page_guts for a future LSN. RecoveryInProgress()
returns true, so it enters WaitForLSN(WAIT_LSN_TYPE_REPLAY, ...).

2. The Event: pg_promote() is issued. The Startup process finishes
recovery and broadcasts a wake-up to all waiters.

3. Detection: WaitForLSN returns WAIT_LSN_RESULT_NOT_IN_RECOVERY. The
code explicitly handles this case:

case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
/* Promoted while waiting... loop back */
break;

4. The Transition: The loop restarts.
-- RecoveryInProgress() is checked again and now returns false.
-- The logic automatically switches branches to
WaitForLSN(WAIT_LSN_TYPE_FLUSH, ...).

5. This transition relaxes the constraint from "wait for replay"
(required for consistency on standby) to "wait for flush" (required
for durability on primary).

6. Timeline Divergence:
XLogReadDetermineTimeline is called at the top of the loop.

-- Scenario A: Waiting for Historical Data (Pre-Promotion)
If we were waiting for LSN 0/5000 and promotion happened at 0/6000
(creating TLI 2), XLogReadDetermineTimeline will see that 0/5000
belongs to TLI 1 (now historical).
Result: state->currTLI (1) != currTLI (2).
Action: The loop breaks immediately (via the else block), skipping
the wait. Since the data is historical, it is immutable and assumed to
be on disk.

-- Scenario B: Waiting for Future Data (Post-Promotion)
If we were waiting for LSN 0/7000 and promotion happened at 0/6000
(creating TLI 2), XLogReadDetermineTimeline will identify that 0/7000
belongs to the new TLI 2.
Result: state->currTLI (2) == currTLI (2).
Action: The loop continues, and we enter
WaitForLSN(WAIT_LSN_TYPE_FLUSH, ...) to wait for the new primary to
generate this data.

-- Scenario C: Waiting exactly at the Switch Point
If we were waiting for the exact LSN where the timeline switched.
Action: XLogReadDetermineTimeline handles the boundary calculation
(tliSwitchPoint), ensuring we read from the correct segment file
(e.g., switching from 00000001... to 00000002...).

--
Best,
Xuneng