Implement waiting for wal lsn replay: reloaded
Hi!
Introduction
The simple way to wait for a given lsn to replay on standby appears to
be useful because it provides a way to achieve read-your-writes
consistency while working with both replication leader and standby.
And it's both handy and cheaper to have built-in functionality for
that instead of polling pg_last_wal_replay_lsn().
Key problem
While this feature generally looks trivial, there is a surprisingly
hard problem. While waiting for an LSN to replay, you should hold any
snapshots. If you hold a snapshot on standby, that snapshot could
prevent the replay of WAL records. In turn, that could prevent the
wait to finish, causing a kind of deadlock. Therefore, waiting for
LSN to replay couldn't be implemented as a function. My last attempt
implements this functionality as a stored procedure [1]. This
approach generally works but has a couple of serious limitations.
1) Given that a CALL statement has to lookup a catalog for the stored
procedure, we can't work inside a transaction of REPEATABLE READ or a
higher isolation level (even if nothing has been done before in that
transaction). It is especially unpleasant that this limitation covers
the case of the implicit transaction when
default_transaction_isolation = 'repeatable read' [2]. I had a
workaround for that [3], but it looks a bit awkward.
2) Using output parameters for a stored procedure causes an extra
snapshot to be held. And that snapshot is difficult (unsafe?) to
release [3].
Present solution
The present patch implements a new utility command WAIT FOR LSN
'target_lsn' [, TIMEOUT 'timeout'][, THROW 'throw']. Unlike previous
attempts to implement custom syntax, it uses only one extra unreserved
keyword. The parameters are implemented as generic_option_list.
Custom syntax eliminates the problem of running within an empty
transaction of REPEATABLE READ level or higher. We don't need to
lookup a system catalog. Thus, we have to set a transaction snapshot.
Also, revising PlannedStmtRequiresSnapshot() allows us to avoid
holding a snapshot to return a value. Therefore, the WAIT command in
the attached patch returns its result status.
Also, the attached patch explicitly checks if the standby has been
promoted to throw the most relevant form of an error. The issue of
inaccurate error messages has been previously spotted in [5].
Any comments?
Links.
1. /messages/by-id/E1sZwuz-002NPQ-Lc@gemulon.postgresql.org
2. /messages/by-id/14de8671-e328-4c3e-b136-664f6f13a39f@iki.fi
3. /messages/by-id/CAPpHfdvRmTzGJw5rQdSMkTxUPZkjwtbQ=LJE2u9Jqh9gFXHpmg@mail.gmail.com
4. /messages/by-id/4953563546cb8c8851f84c7debf723ef@postgrespro.ru
5. /messages/by-id/ab0eddce-06d4-4db2-87ce-46fa2427806c@iki.fi
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v1-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v1-0001-Implement-WAIT-FOR-command.patchDownload
From 496808d1e9af1ae20bab59761be9d27c0cbaca2a Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 19 Nov 2024 07:16:41 +0200
Subject: [PATCH v1] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 336 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 185 ++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 14 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 89 +++++
src/include/commands/wait.h | 21 ++
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 1 +
src/test/recovery/t/043_wait_for_lsn.pl | 217 +++++++++++
src/tools/pgindent/typedefs.list | 4 +
28 files changed, 966 insertions(+), 11 deletions(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/043_wait_for_lsn.pl
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..8b585cba751 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY wait SYSTEM "wait.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..bd14ec00d2d 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &wait;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557cd..91d258f9df1 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b7ebcc2a557..004f7e10e55 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bcab..f14d3933aec 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6173,6 +6174,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 05c738d6614..869cb524082 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1828,6 +1829,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..313c8cc35df
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,336 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber *wakeUpProcs;
+ int numWakeUpProcs = 0;
+
+ wakeUpProcs = palloc(sizeof(ProcNumber) * MaxBackends);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not yet
+ * replayed. Record the process numbers to wake up, but to avoid holding
+ * the lock for too long, send the wakeups only after releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed. As
+ * the time consuming operations, we do it this outside of WaitLSNLock.
+ * This is actually fine because procLatch isn't ever freed, so we just
+ * can potentially set the wrong process' (or no process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ {
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+ }
+ pfree(wakeUpProcs);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..d8f6965d8c6 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..3f06dc53410 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..3cc5b2e832f
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ char *result;
+
+ /*
+ * Process the list of parameters.
+ */
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ char *name = str_tolower(defel->defname, strlen(defel->defname),
+ DEFAULT_COLLATION_OID);
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ timeout = pg_strtoint64(strVal(defel->arg));
+ }
+ else if (strcmp(name, "throw") == 0)
+ {
+ throw = DatumGetBool(DirectFunctionCall1(boolin,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("wrong wait argument: %s",
+ defel->defname)));
+ }
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * pg_wal_replay_wait() is a procedure, not a function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (GetOldestSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index fe1deba13ec..7858e5e076b 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 67eb96396af..7b692954f20 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -299,7 +299,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -778,7 +778,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1106,6 +1106,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16266,6 +16267,14 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR generic_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
/*
* Aggregate decoration clauses
@@ -17922,6 +17931,7 @@ unreserved_keyword:
| VIEW
| VIEWS
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7783ba854fc..d68aa29d93e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 720ef99ee83..1f4c93520ff 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -891,6 +892,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 0c45fcf318f..116642b81b6 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1168,10 +1168,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1760,7 +1761,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f28bf371059..1507f784ac0 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72d..8efb4044d6f 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -87,6 +87,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the physical standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -345,6 +346,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..41234f6b961
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 7eade81535a..9e1c26033a1 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 0f9462493e3..1502be41688 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4258,4 +4258,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 899d64ad55f..87c58d2063b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -491,6 +491,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54fb..88dc79b2bd6 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd93..02a6d576f08 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..32040d43550 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
't/040_standby_failover_slots_sync.pl',
't/041_checkpoint_at_promote.pl',
't/042_low_level_backup.pl',
+ 't/043_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/043_wait_for_lsn.pl b/src/test/recovery/t/043_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/043_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b54428b38cd..cac2424a99b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3129,7 +3129,11 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5 (Apple Git-154)
On Wed, 27 Nov 2024 at 09:09, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi!
Introduction
The simple way to wait for a given lsn to replay on standby appears to
be useful because it provides a way to achieve read-your-writes
consistency while working with both replication leader and standby.
And it's both handy and cheaper to have built-in functionality for
that instead of polling pg_last_wal_replay_lsn().Key problem
While this feature generally looks trivial, there is a surprisingly
hard problem. While waiting for an LSN to replay, you should hold any
snapshots. If you hold a snapshot on standby, that snapshot could
prevent the replay of WAL records. In turn, that could prevent the
wait to finish, causing a kind of deadlock. Therefore, waiting for
LSN to replay couldn't be implemented as a function. My last attempt
implements this functionality as a stored procedure [1]. This
approach generally works but has a couple of serious limitations.
1) Given that a CALL statement has to lookup a catalog for the stored
procedure, we can't work inside a transaction of REPEATABLE READ or a
higher isolation level (even if nothing has been done before in that
transaction). It is especially unpleasant that this limitation covers
the case of the implicit transaction when
default_transaction_isolation = 'repeatable read' [2]. I had a
workaround for that [3], but it looks a bit awkward.
2) Using output parameters for a stored procedure causes an extra
snapshot to be held. And that snapshot is difficult (unsafe?) to
release [3].Present solution
The present patch implements a new utility command WAIT FOR LSN
'target_lsn' [, TIMEOUT 'timeout'][, THROW 'throw']. Unlike previous
attempts to implement custom syntax, it uses only one extra unreserved
keyword. The parameters are implemented as generic_option_list.Custom syntax eliminates the problem of running within an empty
transaction of REPEATABLE READ level or higher. We don't need to
lookup a system catalog. Thus, we have to set a transaction snapshot.Also, revising PlannedStmtRequiresSnapshot() allows us to avoid
holding a snapshot to return a value. Therefore, the WAIT command in
the attached patch returns its result status.Also, the attached patch explicitly checks if the standby has been
promoted to throw the most relevant form of an error. The issue of
inaccurate error messages has been previously spotted in [5].Any comments?
Links.
1. /messages/by-id/E1sZwuz-002NPQ-Lc@gemulon.postgresql.org
2. /messages/by-id/14de8671-e328-4c3e-b136-664f6f13a39f@iki.fi
3. /messages/by-id/CAPpHfdvRmTzGJw5rQdSMkTxUPZkjwtbQ=LJE2u9Jqh9gFXHpmg@mail.gmail.com
4. /messages/by-id/4953563546cb8c8851f84c7debf723ef@postgrespro.ru
5. /messages/by-id/ab0eddce-06d4-4db2-87ce-46fa2427806c@iki.fi------
Regards,
Alexander Korotkov
Supabase
Hi!
What's the current status of
https://commitfest.postgresql.org/50/5167/ ? Should we close it or
reattach to this thread?
--
Best regards,
Kirill Reshke
On 12/4/24 18:12, Kirill Reshke wrote:
On Wed, 27 Nov 2024 at 09:09, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Any comments?
What's the current status of
https://commitfest.postgresql.org/50/5167/ ? Should we close it or
reattach to this thread?
To push this feature further I rebased the patch onto current master.
Also, let's add a commitfest entry:
https://commitfest.postgresql.org/52/5550/
--
regards, Andrei Lepikhov
Attachments:
v2-0001-Implement-WAIT-FOR-command.patchtext/x-patch; charset=UTF-8; name=v2-0001-Implement-WAIT-FOR-command.patchDownload
From ea224b84d343ea726f47af30a7a974e0736d79cc Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" <lepihov@gmail.com>
Date: Thu, 6 Feb 2025 14:13:09 +0700
Subject: [PATCH v2] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 336 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 185 ++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 15 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 89 +++++
src/include/commands/wait.h | 21 ++
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 1 +
src/test/recovery/t/044_wait_for_lsn.pl | 217 +++++++++++
src/tools/pgindent/typedefs.list | 4 +
28 files changed, 967 insertions(+), 11 deletions(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/044_wait_for_lsn.pl
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867..8b585cba75 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY wait SYSTEM "wait.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83f..bd14ec00d2 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &wait;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..a32f473e0a 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8..74a62ab3ea 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d331ab90d7..8336bb0cd1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9c270e7d46..62c37f31ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6194,6 +6195,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 473de6710d..5364576ca5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1829,6 +1830,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 0000000000..313c8cc35d
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,336 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber *wakeUpProcs;
+ int numWakeUpProcs = 0;
+
+ wakeUpProcs = palloc(sizeof(ProcNumber) * MaxBackends);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not yet
+ * replayed. Record the process numbers to wake up, but to avoid holding
+ * the lock for too long, send the wakeups only after releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed. As
+ * the time consuming operations, we do it this outside of WaitLSNLock.
+ * This is actually fine because procLatch isn't ever freed, so we just
+ * can potentially set the wrong process' (or no process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ {
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+ }
+ pfree(wakeUpProcs);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ef0d407a38..f5db28bbd2 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..8351733500
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ char *result;
+
+ /*
+ * Process the list of parameters.
+ */
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ char *name = str_tolower(defel->defname, strlen(defel->defname),
+ DEFAULT_COLLATION_OID);
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ timeout = pg_strtoint64(strVal(defel->arg));
+ }
+ else if (strcmp(name, "throw") == 0)
+ {
+ throw = DatumGetBool(DirectFunctionCall1(boolin,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("wrong wait argument: %s",
+ defel->defname)));
+ }
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * pg_wal_replay_wait() is a procedure, not a function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1..fa8431f794 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d7f9c00c40..67aa9554e2 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16334,6 +16335,14 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR generic_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
/*
* Aggregate decoration clauses
@@ -17991,6 +18000,7 @@ unreserved_keyword:
| VIEW
| VIEWS
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18646,6 +18656,7 @@ bare_label_keyword:
| VIEW
| VIEWS
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed7036..27b447b7a7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a2..dbb613663f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 6f22496305..661296107c 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1162,10 +1162,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1751,7 +1752,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d5801..d23ac3b0f0 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f07162..3b282043ec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the physical standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -346,6 +347,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 0000000000..41234f6b96
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..a7fa00ed41
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1..567586f2ec 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ffe155ee20..3dc1c1a56f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4305,4 +4305,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index cf2917ad07..0d0d8f4ab4 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -492,6 +492,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf56545238..a3f6607128 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d5..c4606d6504 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 0428704dbf..c1328b1e16 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -52,6 +52,7 @@ tests += {
't/041_checkpoint_at_promote.pl',
't/042_low_level_backup.pl',
't/043_no_contrecord_switch.pl',
+ 't/044_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/044_wait_for_lsn.pl b/src/test/recovery/t/044_wait_for_lsn.pl
new file mode 100644
index 0000000000..79c2c49b9c
--- /dev/null
+++ b/src/test/recovery/t/044_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9a3bee93de..1e0be9f4f6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3149,7 +3149,11 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5
27.11.2024 07:08, Alexander Korotkov wrote:
Present solution
The present patch implements a new utility command WAIT FOR LSN
'target_lsn' [, TIMEOUT 'timeout'][, THROW 'throw']. Unlike previous
attempts to implement custom syntax, it uses only one extra unreserved
keyword. The parameters are implemented as generic_option_list.Custom syntax eliminates the problem of running within an empty
transaction of REPEATABLE READ level or higher. We don't need to
lookup a system catalog. Thus, we have to set a transaction snapshot.Also, revising PlannedStmtRequiresSnapshot() allows us to avoid
holding a snapshot to return a value. Therefore, the WAIT command in
the attached patch returns its result status.Also, the attached patch explicitly checks if the standby has been
promoted to throw the most relevant form of an error. The issue of
inaccurate error messages has been previously spotted in [5].Any comments?
Good day, Alexander.
I briefly looked into patch and have couple of minor remarks:
1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.
2. I'd move `inHeap` field between `procno` and `phNode` to fill the gap
between fields on 64bit platforms.
Well, I believe, it would be better to tweak `pairingheap_node` to make it
clear if it is in heap or not. But such change would be unrelated to
current patch's sense. So lets stick with `inHeap`, but move it a bit.
Non-code question: do you imagine for `WAIT` command reuse for other cases?
Is syntax rule in gram.y convenient enough for such reuse? I believe, `LSN`
is not part of syntax to not introduce new keyword. But is it correct way?
I have no answer or strong opinion.
Otherwise, the patch looks quite strong to me.
-------
regards
Yura Sokolov
Hi, Yura!
On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
I briefly looked into patch and have couple of minor remarks:
1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.
OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.
2. I'd move `inHeap` field between `procno` and `phNode` to fill the gap
between fields on 64bit platforms.
Well, I believe, it would be better to tweak `pairingheap_node` to make it
clear if it is in heap or not. But such change would be unrelated to
current patch's sense. So lets stick with `inHeap`, but move it a bit.
Ok, `inHeap` is moved.
Non-code question: do you imagine for `WAIT` command reuse for other cases?
Is syntax rule in gram.y convenient enough for such reuse? I believe, `LSN`
is not part of syntax to not introduce new keyword. But is it correct way?
I have no answer or strong opinion.
This is conscious decision. New rules and new keywords causes extra
states for parser state machine. There could be raised a question
whether feature is valuable enough to justify the slowdown of parser.
This is why I tried to make this feature as less invasive as possible
in terms of parser. And yes, there potentially could be other things
to wait. For instance, instead of waiting for lsn replay we could be
waiting for finishing replay of given xid.
Otherwise, the patch looks quite strong to me.
Great, thank you!
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v2-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v2-0001-Implement-WAIT-FOR-command.patchDownload
From 6324f7496fac463d98857b2c8ac9cbe3f2f40abf Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 19 Nov 2024 07:16:41 +0200
Subject: [PATCH v2] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 351 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 185 +++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 14 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 89 +++++
src/include/commands/wait.h | 21 ++
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 1 +
src/test/recovery/t/044_wait_for_lsn.pl | 217 +++++++++++
src/tools/pgindent/typedefs.list | 4 +
28 files changed, 981 insertions(+), 11 deletions(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/044_wait_for_lsn.pl
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..8b585cba751 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY wait SYSTEM "wait.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..bd14ec00d2d 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &wait;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a50fd99d9e5..12ea4f2cb45 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6194,6 +6195,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 473de6710d7..5364576ca5a 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1829,6 +1830,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..5b70ba90ec1
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,351 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to avoid palloc() for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcsStatic[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ ProcNumber *wakeUpProcs = wakeUpProcsStatic;
+ int numWakeUpProcs = 0;
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not yet
+ * replayed. Record the process numbers to wake up, but to avoid holding
+ * the lock for too long, send the wakeups only after releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ /*
+ * Check if we don't fit to WAKEUP_PROC_STATIC_ARRAY_SIZE. Otherwise,
+ * allocate entries for every backend. It should be enough for every
+ * case.
+ */
+ if (wakeUpProcs == wakeUpProcsStatic &&
+ numWakeUpProcs >= WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ wakeUpProcs = palloc(sizeof(ProcNumber) * MaxBackends);
+
+ Assert(numWakeUpProcs < MaxBackends);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed. As
+ * the time consuming operations, we do it this outside of WaitLSNLock.
+ * This is actually fine because procLatch isn't ever freed, so we just
+ * can potentially set the wrong process' (or no process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ if (wakeUpProcs != wakeUpProcsStatic)
+ pfree(wakeUpProcs);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..d8f6965d8c6 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ef0d407a383..f5db28bbd22 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..83517335003
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ char *result;
+
+ /*
+ * Process the list of parameters.
+ */
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ char *name = str_tolower(defel->defname, strlen(defel->defname),
+ DEFAULT_COLLATION_OID);
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ timeout = pg_strtoint64(strVal(defel->arg));
+ }
+ else if (strcmp(name, "throw") == 0)
+ {
+ throw = DatumGetBool(DirectFunctionCall1(boolin,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("wrong wait argument: %s",
+ defel->defname)));
+ }
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * pg_wal_replay_wait() is a procedure, not a function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d3887628d46..4f8f242b2cf 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16341,6 +16342,14 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR generic_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
/*
* Aggregate decoration clauses
@@ -17999,6 +18008,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a20..dbb613663fa 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 6f22496305a..661296107ce 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1162,10 +1162,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1751,7 +1752,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..3b282043eca 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the physical standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -346,6 +347,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 8dd421fa0ef..08fb233ecae 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4306,4 +4306,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 0428704dbfd..52ec036e27e 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -52,6 +52,7 @@ tests += {
't/041_checkpoint_at_promote.pl',
't/042_low_level_backup.pl',
't/043_no_contrecord_switch.pl',
+ 't/044_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/044_wait_for_lsn.pl b/src/test/recovery/t/044_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/044_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b6c170ac249..6b05cd3842f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3151,7 +3151,11 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5 (Apple Git-154)
17.02.2025 00:27, Alexander Korotkov wrote:
On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
I briefly looked into patch and have couple of minor remarks:
1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.
I've rebased patch and:
- fixed compiler warning in wait.c ("maybe uninitialized 'result'").
- made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
keep indentation, perhaps `do {} while` would be better?
-------
regards
Yura Sokolov aka funny-falcon
Attachments:
v3-0001-Implement-WAIT-FOR-command.patchtext/x-patch; charset=UTF-8; name=v3-0001-Implement-WAIT-FOR-command.patchDownload
From fa107e15eab3ec2493f0663f03b563d49979e0b5 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Fri, 28 Feb 2025 15:40:18 +0300
Subject: [PATCH v3] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 347 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 185 ++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 14 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 89 +++++
src/include/commands/wait.h | 21 ++
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 1 +
src/test/recovery/t/045_wait_for_lsn.pl | 217 +++++++++++
src/tools/pgindent/typedefs.list | 4 +
28 files changed, 977 insertions(+), 11 deletions(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/045_wait_for_lsn.pl
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..8b585cba751 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY wait SYSTEM "wait.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..bd14ec00d2d 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &wait;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e18..b9abb696a5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6219,6 +6220,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52f53fa12e0..b03a39b510d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1831,6 +1832,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..a0f0e480a48
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+resume:
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not yet
+ * replayed. Record the process numbers to wake up, but to avoid holding
+ * the lock for too long, send the wakeups only after releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < MaxBackends);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed. As
+ * the time consuming operations, we do it this outside of WaitLSNLock.
+ * This is actually fine because procLatch isn't ever freed, so we just
+ * can potentially set the wrong process' (or no process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ goto resume;
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 85cfea6fd71..12459111f7c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -63,6 +63,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ce8d1ab8bac..b1c60f60ea7 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -52,4 +52,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..a5f44de1303
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+
+ /*
+ * Process the list of parameters.
+ */
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ char *name = str_tolower(defel->defname, strlen(defel->defname),
+ DEFAULT_COLLATION_OID);
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ timeout = pg_strtoint64(strVal(defel->arg));
+ }
+ else if (strcmp(name, "throw") == 0)
+ {
+ throw = DatumGetBool(DirectFunctionCall1(boolin,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("wrong wait argument: %s",
+ defel->defname)));
+ }
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * pg_wal_replay_wait() is a procedure, not a function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7d99c9355c6..11265ae3383 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16341,6 +16342,14 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR generic_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
/*
* Aggregate decoration clauses
@@ -17999,6 +18008,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a20..dbb613663fa 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index dea24453a6c..61cf02c9527 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1194,10 +1194,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1792,7 +1793,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..3b282043eca 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the physical standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -346,6 +347,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 0b208f51bdd..1c3baac08a9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4317,4 +4317,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 057bcde1434..8a8f1f6c427 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -53,6 +53,7 @@ tests += {
't/042_low_level_backup.pl',
't/043_no_contrecord_switch.pl',
't/044_invalidate_inactive_slots.pl',
+ 't/045_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/045_wait_for_lsn.pl b/src/test/recovery/t/045_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/045_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fcb968e1ffe..7b6c30c8d4f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3169,7 +3169,11 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.43.0
28.02.2025 16:03, Yura Sokolov пишет:
17.02.2025 00:27, Alexander Korotkov wrote:
On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
I briefly looked into patch and have couple of minor remarks:
1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.I've rebased patch and:
- fixed compiler warning in wait.c ("maybe uninitialized 'result'").
- made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
keep indentation, perhaps `do {} while` would be better?
And fixed:
'WAIT' is marked as BARE_LABEL in kwlist.h, but it is missing from
gram.y's bare_label_keyword rule
-------
regards
Yura Sokolov aka funny-falcon
Attachments:
v4-0001-Implement-WAIT-FOR-command.patchtext/x-patch; charset=UTF-8; name=v4-0001-Implement-WAIT-FOR-command.patchDownload
From d9c44427a4cbecd6dd27edae48ea42d933756ff9 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Fri, 28 Feb 2025 15:40:18 +0300
Subject: [PATCH v4] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 347 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 185 ++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 15 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 89 +++++
src/include/commands/wait.h | 21 ++
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 1 +
src/test/recovery/t/045_wait_for_lsn.pl | 217 +++++++++++
src/tools/pgindent/typedefs.list | 4 +
28 files changed, 978 insertions(+), 11 deletions(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/045_wait_for_lsn.pl
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..8b585cba751 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY wait SYSTEM "wait.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..bd14ec00d2d 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &wait;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e18..b9abb696a5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6219,6 +6220,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52f53fa12e0..b03a39b510d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1831,6 +1832,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..a0f0e480a48
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+resume:
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not yet
+ * replayed. Record the process numbers to wake up, but to avoid holding
+ * the lock for too long, send the wakeups only after releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < MaxBackends);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed. As
+ * the time consuming operations, we do it this outside of WaitLSNLock.
+ * This is actually fine because procLatch isn't ever freed, so we just
+ * can potentially set the wrong process' (or no process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ goto resume;
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 85cfea6fd71..12459111f7c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -63,6 +63,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ce8d1ab8bac..b1c60f60ea7 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -52,4 +52,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..a5f44de1303
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+
+ /*
+ * Process the list of parameters.
+ */
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ char *name = str_tolower(defel->defname, strlen(defel->defname),
+ DEFAULT_COLLATION_OID);
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ timeout = pg_strtoint64(strVal(defel->arg));
+ }
+ else if (strcmp(name, "throw") == 0)
+ {
+ throw = DatumGetBool(DirectFunctionCall1(boolin,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("wrong wait argument: %s",
+ defel->defname)));
+ }
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * pg_wal_replay_wait() is a procedure, not a function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7d99c9355c6..3034573648f 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16341,6 +16342,14 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR generic_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
/*
* Aggregate decoration clauses
@@ -17999,6 +18008,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18655,6 +18665,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 49204f91a20..dbb613663fa 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index dea24453a6c..61cf02c9527 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1194,10 +1194,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1792,7 +1793,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..3b282043eca 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the physical standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -346,6 +347,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 0b208f51bdd..1c3baac08a9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4317,4 +4317,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 057bcde1434..8a8f1f6c427 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -53,6 +53,7 @@ tests += {
't/042_low_level_backup.pl',
't/043_no_contrecord_switch.pl',
't/044_invalidate_inactive_slots.pl',
+ 't/045_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/045_wait_for_lsn.pl b/src/test/recovery/t/045_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/045_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fcb968e1ffe..7b6c30c8d4f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3169,7 +3169,11 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.43.0
On Fri, Feb 28, 2025 at 3:55 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
28.02.2025 16:03, Yura Sokolov пишет:
17.02.2025 00:27, Alexander Korotkov wrote:
On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
I briefly looked into patch and have couple of minor remarks:
1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.I've rebased patch and:
- fixed compiler warning in wait.c ("maybe uninitialized 'result'").
- made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
keep indentation, perhaps `do {} while` would be better?And fixed:
'WAIT' is marked as BARE_LABEL in kwlist.h, but it is missing from
gram.y's bare_label_keyword rule
Thank you, Yura. I've further revised the patch. Mostly added the
documentation including SQL command reference and few paragraphs in
the high availability chapter explaining the read-your-writes
consistency concept.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v5-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v5-0001-Implement-WAIT-FOR-command.patchDownload
From 8431a654aa5b872acef2bca7e66dfaff7dd5254d Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 10 Mar 2025 12:59:38 +0200
Subject: [PATCH v5] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
doc/src/sgml/high-availability.sgml | 54 +++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 216 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 347 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 185 ++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 15 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 89 +++++
src/include/commands/wait.h | 21 ++
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 1 +
src/test/recovery/t/045_wait_for_lsn.pl | 217 +++++++++++
src/tools/pgindent/typedefs.list | 4 +
30 files changed, 1248 insertions(+), 11 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/045_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index acf3ac0601d..ae316b5a0c9 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it possible to address this without switching to the synchronous
+ replication
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..9d6d3175f02
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,216 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR ( <replaceable class="parameter">parameter</replaceable> '<replaceable class="parameter">value</replaceable>' [, ... ] ) ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ as soon as <parameter>throw</parameter> is not specified or set to true.
+ If <parameter>throw</parameter> is set to false, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ The target log sequence number to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ When specified and greater than zero, the command waits until
+ <parameter>lsn</parameter> is reached or the specified
+ <parameter>timeout</parameter> has elapsed. Must be a non-negative
+ integer, the default is zero.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">throw</replaceable></term>
+ <listitem>
+ <para>
+ Specify whether to throw an error in the case of timeout or
+ running on the primary. The valid values are <literal>true</literal>
+ and <literal>false</literal>. The default is <literal>true</literal>.
+ When set to <literal>false</literal> the status can be get from the
+ return `value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Return values</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">success</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">not in recovery</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this function execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out as soon as <parameter>throw</parameter> is true.
+ However, if <function>pg_wal_replay_wait</function> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+ </programlisting>
+ </para>
+
+ <para>
+ It may also happen that target <parameter>lsn</parameter> is not reached
+ within the timeout. In that case the error is thrown.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20', TIMEOUT '100';
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+ </programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>throw</parameter> set to false.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20', TIMEOUT '100', THROW 'false';
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+ </programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e18..b9abb696a5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6219,6 +6220,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a829a055a97..1beb3999769 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1831,6 +1832,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..a0f0e480a48
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+resume:
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not yet
+ * replayed. Record the process numbers to wake up, but to avoid holding
+ * the lock for too long, send the wakeups only after releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < MaxBackends);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed. As
+ * the time consuming operations, we do it this outside of WaitLSNLock.
+ * This is actually fine because procLatch isn't ever freed, so we just
+ * can potentially set the wrong process' (or no process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ goto resume;
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 85cfea6fd71..12459111f7c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -63,6 +63,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ce8d1ab8bac..b1c60f60ea7 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -52,4 +52,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..a5f44de1303
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,185 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+
+ /*
+ * Process the list of parameters.
+ */
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ char *name = str_tolower(defel->defname, strlen(defel->defname),
+ DEFAULT_COLLATION_OID);
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ timeout = pg_strtoint64(strVal(defel->arg));
+ }
+ else if (strcmp(name, "throw") == 0)
+ {
+ throw = DatumGetBool(DirectFunctionCall1(boolin,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("wrong wait argument: %s",
+ defel->defname)));
+ }
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * pg_wal_replay_wait() is a procedure, not a function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 271ae26cbaf..e4916148d02 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16369,6 +16370,14 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR generic_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
/*
* Aggregate decoration clauses
@@ -18027,6 +18036,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18683,6 +18693,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 749a79d48ef..1a99e98f55b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index dea24453a6c..61cf02c9527 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1194,10 +1194,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1792,7 +1793,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3c594415bfd..5849967882e 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the physical standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -346,6 +347,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 23c9e3c5abf..dffa714e2c8 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4319,4 +4319,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 057bcde1434..8a8f1f6c427 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -53,6 +53,7 @@ tests += {
't/042_low_level_backup.pl',
't/043_no_contrecord_switch.pl',
't/044_invalidate_inactive_slots.pl',
+ 't/045_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/045_wait_for_lsn.pl b/src/test/recovery/t/045_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/045_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9840060997f..5ce3d36ae6d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3175,7 +3175,11 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5 (Apple Git-154)
10.03.2025 14:30, Alexander Korotkov пишет:
On Fri, Feb 28, 2025 at 3:55 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
28.02.2025 16:03, Yura Sokolov пишет:
17.02.2025 00:27, Alexander Korotkov wrote:
On Thu, Feb 6, 2025 at 10:31 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
I briefly looked into patch and have couple of minor remarks:
1. I don't like `palloc` in the `WaitLSNWakeup`. I believe it wont issue
problems, but still don't like it. I'd prefer to see local fixed array, say
of 16 elements, and loop around remaining function body acting in batch of
16 wakeups. Doubtfully there will be more than 16 waiting clients often,
and even then it wont be much heavier than fetching all at once.OK, I've refactored this to use static array of 16 size. palloc() is
used only if we don't fit static array.I've rebased patch and:
- fixed compiler warning in wait.c ("maybe uninitialized 'result'").
- made a loop without call to palloc in WaitLSNWakeup. It is with "goto" to
keep indentation, perhaps `do {} while` would be better?And fixed:
'WAIT' is marked as BARE_LABEL in kwlist.h, but it is missing from
gram.y's bare_label_keyword ruleThank you, Yura. I've further revised the patch. Mostly added the
documentation including SQL command reference and few paragraphs in
the high availability chapter explaining the read-your-writes
consistency concept.
Good day, Alexander.
Looking "for the last time" to the patch I found there remains
`pg_wal_replay_wait` function in documentation and one comment.
So I fixed it in documentation, and removed sentence from comment.
Otherwise v6 is just rebased v5.
-------
regards
Yura Sokolov aka funny-falcon
Attachments:
v6-0001-Implement-WAIT-FOR-command.patchtext/x-patch; charset=UTF-8; name=v6-0001-Implement-WAIT-FOR-command.patchDownload
From 80b4cb8c0ac75168ab1fce55feccc4f08f32ce34 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 10 Mar 2025 12:59:38 +0200
Subject: [PATCH v6] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru>
---
doc/src/sgml/high-availability.sgml | 54 +++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 216 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 347 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 184 ++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 15 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 89 +++++
src/include/commands/wait.h | 21 ++
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 1 +
src/test/recovery/t/045_wait_for_lsn.pl | 217 +++++++++++
src/tools/pgindent/typedefs.list | 4 +
30 files changed, 1247 insertions(+), 11 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/045_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index acf3ac0601d..ae316b5a0c9 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it possible to address this without switching to the synchronous
+ replication
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..2352ae9493f
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,216 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR ( <replaceable class="parameter">parameter</replaceable> '<replaceable class="parameter">value</replaceable>' [, ... ] ) ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ as soon as <parameter>throw</parameter> is not specified or set to true.
+ If <parameter>throw</parameter> is set to false, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ The target log sequence number to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ When specified and greater than zero, the command waits until
+ <parameter>lsn</parameter> is reached or the specified
+ <parameter>timeout</parameter> has elapsed. Must be a non-negative
+ integer, the default is zero.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">throw</replaceable></term>
+ <listitem>
+ <para>
+ Specify whether to throw an error in the case of timeout or
+ running on the primary. The valid values are <literal>true</literal>
+ and <literal>false</literal>. The default is <literal>true</literal>.
+ When set to <literal>false</literal> the status can be get from the
+ return `value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Return values</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">success</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">not in recovery</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this function execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out as soon as <parameter>throw</parameter> is true.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+ </programlisting>
+ </para>
+
+ <para>
+ It may also happen that target <parameter>lsn</parameter> is not reached
+ within the timeout. In that case the error is thrown.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20', TIMEOUT '100';
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+ </programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>throw</parameter> set to false.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20', TIMEOUT '100', THROW 'false';
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+ </programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1b4f21a88d3..e617ae8ead5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2826,6 +2827,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 799fc739e18..b9abb696a5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6219,6 +6220,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index a829a055a97..1beb3999769 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1831,6 +1832,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..a0f0e480a48
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+resume:
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not yet
+ * replayed. Record the process numbers to wake up, but to avoid holding
+ * the lock for too long, send the wakeups only after releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < MaxBackends);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed. As
+ * the time consuming operations, we do it this outside of WaitLSNLock.
+ * This is actually fine because procLatch isn't ever freed, so we just
+ * can potentially set the wrong process' (or no process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ goto resume;
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 85cfea6fd71..12459111f7c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -63,6 +63,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index ce8d1ab8bac..b1c60f60ea7 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -52,4 +52,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..d95782ddaf8
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,184 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+
+ /*
+ * Process the list of parameters.
+ */
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ char *name = str_tolower(defel->defname, strlen(defel->defname),
+ DEFAULT_COLLATION_OID);
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ timeout = pg_strtoint64(strVal(defel->arg));
+ }
+ else if (strcmp(name, "throw") == 0)
+ {
+ throw = DatumGetBool(DirectFunctionCall1(boolin,
+ CStringGetDatum(strVal(defel->arg))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("wrong wait argument: %s",
+ defel->defname)));
+ }
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 271ae26cbaf..e4916148d02 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -786,7 +786,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1114,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16369,6 +16370,14 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR generic_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
/*
* Aggregate decoration clauses
@@ -18027,6 +18036,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18683,6 +18693,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..27b447b7a7a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -340,6 +342,7 @@ CreateOrAttachShmemStructs(void)
StatsShmemInit();
WaitEventCustomShmemInit();
InjectionPointShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 749a79d48ef..1a99e98f55b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -896,6 +897,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index dea24453a6c..61cf02c9527 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1194,10 +1194,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1792,7 +1793,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 3c594415bfd..5849967882e 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -88,6 +88,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the physical standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -346,6 +347,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..0acc61eba5f
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..a7fa00ed41e
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 23c9e3c5abf..dffa714e2c8 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4319,4 +4319,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 40cf090ce61..6d834f25d2d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -493,6 +493,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..a3f66071288 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 057bcde1434..8a8f1f6c427 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -53,6 +53,7 @@ tests += {
't/042_low_level_backup.pl',
't/043_no_contrecord_switch.pl',
't/044_invalidate_inactive_slots.pl',
+ 't/045_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/045_wait_for_lsn.pl b/src/test/recovery/t/045_wait_for_lsn.pl
new file mode 100644
index 00000000000..79c2c49b9ce
--- /dev/null
+++ b/src/test/recovery/t/045_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}', TIMEOUT '1000000';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}', TIMEOUT '10';");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}', TIMEOUT '1000';",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}', TIMEOUT '10', THROW 'false';]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dfe2690bdd3..5377d6208e1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3177,7 +3177,11 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.43.0
Hi,
I did a quick look at this patch. I haven't found any correctness
issues, but I have some general review comments and questions about the
grammar / syntax.
1) The sgml docs don't really show the syntax very nicely, it only shows
this at the beginning of wait_for.sgml:
WAIT FOR ( <replaceable class="parameter">parameter</replaceable>
'<replaceable class="parameter">value</replaceable>' [, ... ] ) ]
I kinda understand this comes from using the generic option list (I'll
get to that shortly), but I think it'd be much better to actually show
the "full" syntax here, instead of leaving the "parameters" to later.
2) The syntax description suggests "(" and ")" are required, but that
does not seem to be the case - in fact, it's not even optional, and when
I try using that, I get syntax error.
3) I have my doubts about using the generic_option_list for this. Yes, I
understand this allows using fewer reserved keywords, but it leads to
some weirdness and I'm not sure it's worth it. Not sure what the right
trade off is here.
Anyway, some examples of the weird stuff implied by this approach:
- it forces "," between the options, which is a clear difference from
what we do for every other command
- it forces everything to be a string, i.e. you can' say "TIMEOUT 10",
it has to be "TIMEOUT '10'"
I don't have a very strong opinion on this, but the result seems a bit
strange to me.
4) I'm not sure I understand the motivation of the "throw false" mode,
and I'm not sure I understand this description in the sgml docs:
On timeout, or if the server is promoted before
<parameter>lsn</parameter> is reached, an error is emitted,
as soon as <parameter>throw</parameter> is not specified or set to
true.
If <parameter>throw</parameter> is set to false, then the command
doesn't throw errors.
I find it a bit confusing. What is the use case for this mode?
5) One place in the docs says:
The target log sequence number to wait for.
Thie is literally the only place using "log sequence number" in our
code base, I'd just use "LSN" just like every other place.
6) The docs for the TIMEOUT parameter say this:
<varlistentry>
<term><replaceable class="parameter">timeout</replaceable></term>
<listitem>
<para>
When specified and greater than zero, the command waits until
<parameter>lsn</parameter> is reached or the specified
<parameter>timeout</parameter> has elapsed. Must be a non-
negative integer, the default is zero.
</para>
</listitem>
</varlistentry>
That doesn't say what unit does the option use. Is is seconds,
milliseconds or what?
In fact, it'd be nice to let users specify that in the value, similar
to other options (e.g. SET statement_timeout = '10s').
7) One place in the docs says this:
That is, after this function execution, the value returned by
<function>pg_last_wal_replay_lsn</function> should be greater ...
I think the reference to "function execution" is obsolete?
8) I find this confusing:
However, if <command>WAIT FOR</command> is
called on primary promoted from standby and <literal>lsn</literal>
was already replayed, then the <command>WAIT FOR</command> command
just exits immediately.
Does this mean running the WAIT command on a primary (after it was
already promoted) will exit immediately? Why does it matter that it
was promoted from a standby? Shouldn't it exit immediately even for
a standalone instance?
9) xlogwait.c
I think this should start with a basic "design" description of how the
wait is implemented, in a comment at the top of the file. That is, what
we keep in the shared memory, what happens during a wait, how it uses
the pairing heap, etc. After reading this comment I should understand
how it all fits together.
10) WaitForLSNReplay / WaitLSNWakeup
I think the function comment should document the important stuff (e.g.
return values for various situations, how it groups waiters into chunks
of 16 elements during wakeup, ...).
11) WaitLSNProcInfo / WaitLSNState
Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?
regards
--
Tomas Vondra
On Wed, 12 Mar 2025 at 20:14, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
Otherwise v6 is just rebased v5.
I noticed that Tomas's comments from [1]/messages/by-id/09a98dc9-eeb1-471d-b990-072513c3d584@vondra.me are not yet addressed, I have
changed the commitfest status to Waiting on Author, please address
them and update it to Needs review.
[1]: /messages/by-id/09a98dc9-eeb1-471d-b990-072513c3d584@vondra.me
Regards,
Vignesh
Hi, Tomas.
Thank you so much for your review! Please find the revised patchset.
On Thu, Mar 13, 2025 at 4:15 PM Tomas Vondra <tomas@vondra.me> wrote:
I did a quick look at this patch. I haven't found any correctness
issues, but I have some general review comments and questions about the
grammar / syntax.1) The sgml docs don't really show the syntax very nicely, it only shows
this at the beginning of wait_for.sgml:WAIT FOR ( <replaceable class="parameter">parameter</replaceable>
'<replaceable class="parameter">value</replaceable>' [, ... ] ) ]I kinda understand this comes from using the generic option list (I'll
get to that shortly), but I think it'd be much better to actually show
the "full" syntax here, instead of leaving the "parameters" to later.
Sounds reasonable, changed to show the full syntax in the synopsis.
2) The syntax description suggests "(" and ")" are required, but that
does not seem to be the case - in fact, it's not even optional, and when
I try using that, I get syntax error.
Good catch, fixed.
3) I have my doubts about using the generic_option_list for this. Yes, I
understand this allows using fewer reserved keywords, but it leads to
some weirdness and I'm not sure it's worth it. Not sure what the right
trade off is here.Anyway, some examples of the weird stuff implied by this approach:
- it forces "," between the options, which is a clear difference from
what we do for every other command- it forces everything to be a string, i.e. you can' say "TIMEOUT 10",
it has to be "TIMEOUT '10'"I don't have a very strong opinion on this, but the result seems a bit
strange to me.
I've improved the syntax. I still tried to keep the number of new
keywords and grammar rules minimal. That leads to moving some parser
login into wait.c. This is probably a bit awkward, but saves our
grammar from bloat. Let me know what do you think about this
approach.
4) I'm not sure I understand the motivation of the "throw false" mode,
and I'm not sure I understand this description in the sgml docs:On timeout, or if the server is promoted before
<parameter>lsn</parameter> is reached, an error is emitted,
as soon as <parameter>throw</parameter> is not specified or set to
true.
If <parameter>throw</parameter> is set to false, then the command
doesn't throw errors.I find it a bit confusing. What is the use case for this mode?
The idea here is that application could do some handling of these
errors without having to parse the error messages (parsing error
messages is inconvenient because of localization etc).
5) One place in the docs says:
The target log sequence number to wait for.
Thie is literally the only place using "log sequence number" in our
code base, I'd just use "LSN" just like every other place.
OK fixed.
6) The docs for the TIMEOUT parameter say this:
<varlistentry>
<term><replaceable class="parameter">timeout</replaceable></term>
<listitem>
<para>
When specified and greater than zero, the command waits until
<parameter>lsn</parameter> is reached or the specified
<parameter>timeout</parameter> has elapsed. Must be a non-
negative integer, the default is zero.
</para>
</listitem>
</varlistentry>That doesn't say what unit does the option use. Is is seconds,
milliseconds or what?In fact, it'd be nice to let users specify that in the value, similar
to other options (e.g. SET statement_timeout = '10s').
The default unit of milliseconds is specified. Also, an alternative
way to specify timeout is now supported. Timeout might be a string
literal consisting of numeric and unit specifier.
7) One place in the docs says this:
That is, after this function execution, the value returned by
<function>pg_last_wal_replay_lsn</function> should be greater ...I think the reference to "function execution" is obsolete?
Actually, this is just the function, which reports current replay LSN,
not function introduced by previous version of this patch. We refer
it to just express the constraint that LSN must be replayed after
execution of the command.
8) I find this confusing:
However, if <command>WAIT FOR</command> is
called on primary promoted from standby and <literal>lsn</literal>
was already replayed, then the <command>WAIT FOR</command> command
just exits immediately.Does this mean running the WAIT command on a primary (after it was
already promoted) will exit immediately? Why does it matter that it
was promoted from a standby? Shouldn't it exit immediately even for
a standalone instance?
I think the previous sentence should give an idea that otherwise error
gets thrown. That also happens immediately for sure.
9) xlogwait.c
I think this should start with a basic "design" description of how the
wait is implemented, in a comment at the top of the file. That is, what
we keep in the shared memory, what happens during a wait, how it uses
the pairing heap, etc. After reading this comment I should understand
how it all fits together.
OK, I've added the header comment.
10) WaitForLSNReplay / WaitLSNWakeup
I think the function comment should document the important stuff (e.g.
return values for various situations, how it groups waiters into chunks
of 16 elements during wakeup, ...).
Revised header comments for those functions too.
11) WaitLSNProcInfo / WaitLSNState
Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?
Hmm, I don't remember why I moved them to xlogwait.h. OK, moved them
back to xlogwait.c.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v6-0001-Implement-WAIT-FOR-command.patchapplication/x-patch; name=v6-0001-Implement-WAIT-FOR-command.patchDownload
From 11f1b1db81ff323354035dba34a34f5ac55177a3 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 10 Mar 2025 12:59:38 +0200
Subject: [PATCH v6] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
doc/src/sgml/high-availability.sgml | 54 +++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 226 +++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 435 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 235 ++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 29 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 +
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 41 ++
src/include/commands/wait.h | 21 +
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 1 +
src/test/recovery/t/046_wait_for_lsn.pl | 217 +++++++++
src/tools/pgindent/typedefs.list | 5 +
30 files changed, 1363 insertions(+), 11 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/046_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..e29141c0538 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it possible to address this without switching to the synchronous
+ replication
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..ff3f309bc7c
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,226 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR ( <replaceable class="parameter">option</replaceable> [, ... ] ) ]
+ALTER ROLE <replaceable class="parameter">role_specification</replaceable> [ WITH ] <replaceable class="parameter">option</replaceable> [ ... ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ LSN '<replaceable class="parameter">lsn</replaceable>'
+ | TIMEOUT <replaceable class="parameter">timeout</replaceable>
+ | NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ as soon as <literal>NO_THROW</literal> is not specified.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Return values</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">success</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">not in recovery</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this function execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+ However, if <function>pg_wal_replay_wait</function> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+ </programlisting>
+ </para>
+
+ <para>
+ It may also happen that target <parameter>lsn</parameter> is not reached
+ within the timeout. In that case the error is thrown.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+ </programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+ </programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..511e5531fb8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2831,6 +2832,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2d4c346473b..a0c98d9e801 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6361,6 +6362,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6ce979f2d8b..2097271b2f8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1836,6 +1837,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..c2aee2d41f0
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,435 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for the replay of the given LSN on a
+ * physical standby. The core idea is very small: every backend that
+ * wants to wait publishes the LSN it needs to the shared memory, and
+ * the startup process wakes it once that LSN has been replayed.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by a startup
+ * process, timeout is reached, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
+ *
+ * After replaying a WAL record, the startup process first performs a
+ * fast path check minWaitedLSN > replayLSN. If this check is negative,
+ * it checks waitersHeap and wakes up the backend whose awaited LSNs
+ * are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not
+ * yet replayed. Record the process numbers to wake up, but to avoid
+ * holding the lock for too long, send the wakeups only after
+ * releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < MaxBackends);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed.
+ * As the time consuming operations, we do it this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ }
+ while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed. Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed. Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..784c779a252
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,235 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "nodes/print.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+ WaitStmtParamNone,
+ WaitStmtParamTimeout,
+ WaitStmtParamLSN
+} WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ WaitStmtParam curParam = WaitStmtParamNone;
+
+ /*
+ * Process the list of parameters.
+ */
+ foreach_ptr(Node, option, stmt->options)
+ {
+ if (IsA(option, String))
+ {
+ String *str = castNode(String, option);
+ char *name = str_tolower(str->sval, strlen(str->sval),
+ DEFAULT_COLLATION_OID);
+
+ if (curParam != WaitStmtParamNone)
+ elog(ERROR, "Unexpected param");
+
+ if (strcmp(name, "lsn") == 0)
+ curParam = WaitStmtParamLSN;
+ else if (strcmp(name, "timeout") == 0)
+ curParam = WaitStmtParamTimeout;
+ else if (strcmp(name, "no_throw") == 0)
+ throw = false;
+ else
+ elog(ERROR, "Unexpected param");
+
+ }
+ else if (IsA(option, Integer))
+ {
+ Integer *intVal = castNode(Integer, option);
+
+ if (curParam != WaitStmtParamTimeout)
+ elog(ERROR, "Unexpected integer");
+
+ timeout = intVal->ival;
+
+ curParam = WaitStmtParamNone;
+ }
+ else if (IsA(option, A_Const))
+ {
+ A_Const *constVal = castNode(A_Const, option);
+ String *str = &constVal->val.sval;
+
+ if (curParam != WaitStmtParamLSN &&
+ curParam != WaitStmtParamTimeout)
+ elog(ERROR, "Unexpected string");
+
+ if (curParam == WaitStmtParamLSN)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(str->sval)));
+ }
+ else if (curParam == WaitStmtParamTimeout)
+ {
+ const char *hintmsg;
+ double result;
+
+ if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid value for timeout option: \"%s\"",
+ str->sval),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0));
+ }
+ timeout = (int64) result;
+ }
+
+ curParam = WaitStmtParamNone;
+ }
+
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * pg_wal_replay_wait() is a procedure, not a function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3c4268b271a..5ff7157a12a 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -303,7 +303,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -672,6 +672,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
+%type <node> wait_option
+%type <list> wait_option_list
+
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
@@ -786,7 +789,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1114,6 +1117,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16364,6 +16368,25 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR wait_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
+
+wait_option_list:
+ wait_option { $$ = list_make1($1); }
+ | wait_option_list wait_option { $$ = lappend($1, $2); }
+ ;
+
+wait_option: ColLabel { $$ = (Node *) makeString($1); }
+ | NumericOnly { $$ = (Node *) $1; }
+ | Sconst { $$ = (Node *) makeStringConst($1, @1); }
+
+ ;
/*
* Aggregate decoration clauses
@@ -18023,6 +18046,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18680,6 +18704,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 00c76d05356..87411aece47 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
size = add_size(size, MemoryContextReportingShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -346,6 +348,7 @@ CreateOrAttachShmemStructs(void)
InjectionPointShmemInit();
AioShmemInit();
MemoryContextReportingShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f194e6b3dcc..c966acdbff0 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -948,6 +949,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 8164d0fbb4f..f4d37c0bfc2 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1195,10 +1195,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1793,7 +1794,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 25fe3d58016..d23ac3b0f0b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1065,6 +1067,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2067,6 +2075,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2122,6 +2133,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3099,6 +3113,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3697,6 +3715,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 930321905f1..164a16bc5d8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the physical standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -353,6 +354,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..15bddd9dba3
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..b44c37aa4db
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4610fc61293..d06104d40ac 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4326,4 +4326,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index a9681738146..eb9de7dae00 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index cb983766c67..31b1e9bffcf 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -54,6 +54,7 @@ tests += {
't/043_no_contrecord_switch.pl',
't/044_invalidate_inactive_slots.pl',
't/045_archive_restartpoint.pl',
+ 't/046_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/046_wait_for_lsn.pl b/src/test/recovery/t/046_wait_for_lsn.pl
new file mode 100644
index 00000000000..f9446cce3f9
--- /dev/null
+++ b/src/test/recovery/t/046_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e5879e00dff..be191d3e2d9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3236,7 +3236,12 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
+WaitStmtParam
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5 (Apple Git-154)
On 2025-Apr-29, Alexander Korotkov wrote:
11) WaitLSNProcInfo / WaitLSNState
Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?Hmm, I don't remember why I moved them to xlogwait.h. OK, moved them
back to xlogwait.c.
This change made the code no longer compile, because
WaitLSNState->minWaitedLSN is used in xlogrecovery.c which no longer has
access to the field definition. A rebased version with that change
reverted is attached.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)
Attachments:
v7-0001-Implement-WAIT-FOR-command.patchtext/x-diff; charset=utf-8Download
From 1f9b5c7427239a6dc43ccad31634687a9d9fcf35 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 10 Mar 2025 12:59:38 +0200
Subject: [PATCH v7] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
doc/src/sgml/high-availability.sgml | 54 +++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 226 ++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 387 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 235 +++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 29 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 +
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 90 ++++
src/include/commands/wait.h | 21 +
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 1 +
src/test/recovery/t/049_wait_for_lsn.pl | 217 ++++++++++
src/tools/pgindent/typedefs.list | 5 +
30 files changed, 1364 insertions(+), 11 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..e29141c0538 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it possible to address this without switching to the synchronous
+ replication
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..ff3f309bc7c
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,226 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR ( <replaceable class="parameter">option</replaceable> [, ... ] ) ]
+ALTER ROLE <replaceable class="parameter">role_specification</replaceable> [ WITH ] <replaceable class="parameter">option</replaceable> [ ... ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ LSN '<replaceable class="parameter">lsn</replaceable>'
+ | TIMEOUT <replaceable class="parameter">timeout</replaceable>
+ | NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ as soon as <literal>NO_THROW</literal> is not specified.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Return values</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">success</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">not in recovery</replaceable></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this function execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+ However, if <function>pg_wal_replay_wait</function> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+ </programlisting>
+ </para>
+
+ <para>
+ It may also happen that target <parameter>lsn</parameter> is not reached
+ within the timeout. In that case the error is thrown.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+ </programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+ </programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9a4de1616bc..d03a9e15c99 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6361,6 +6362,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f23ec8969c2..408454bb8b9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..64049f8e870
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,387 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for the replay of the given LSN on a
+ * physical standby. The core idea is very small: every backend that
+ * wants to wait publishes the LSN it needs to the shared memory, and
+ * the startup process wakes it once that LSN has been replayed.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by a startup
+ * process, timeout is reached, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
+ *
+ * After replaying a WAL record, the startup process first performs a
+ * fast path check minWaitedLSN > replayLSN. If this check is negative,
+ * it checks waitersHeap and wakes up the backend whose awaited LSNs
+ * are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not
+ * yet replayed. Record the process numbers to wake up, but to avoid
+ * holding the lock for too long, send the wakeups only after
+ * releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < MaxBackends);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed.
+ * As the time consuming operations, we do it this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ }
+ while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed. Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed. Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..784c779a252
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,235 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "nodes/print.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+ WaitStmtParamNone,
+ WaitStmtParamTimeout,
+ WaitStmtParamLSN
+} WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ WaitStmtParam curParam = WaitStmtParamNone;
+
+ /*
+ * Process the list of parameters.
+ */
+ foreach_ptr(Node, option, stmt->options)
+ {
+ if (IsA(option, String))
+ {
+ String *str = castNode(String, option);
+ char *name = str_tolower(str->sval, strlen(str->sval),
+ DEFAULT_COLLATION_OID);
+
+ if (curParam != WaitStmtParamNone)
+ elog(ERROR, "Unexpected param");
+
+ if (strcmp(name, "lsn") == 0)
+ curParam = WaitStmtParamLSN;
+ else if (strcmp(name, "timeout") == 0)
+ curParam = WaitStmtParamTimeout;
+ else if (strcmp(name, "no_throw") == 0)
+ throw = false;
+ else
+ elog(ERROR, "Unexpected param");
+
+ }
+ else if (IsA(option, Integer))
+ {
+ Integer *intVal = castNode(Integer, option);
+
+ if (curParam != WaitStmtParamTimeout)
+ elog(ERROR, "Unexpected integer");
+
+ timeout = intVal->ival;
+
+ curParam = WaitStmtParamNone;
+ }
+ else if (IsA(option, A_Const))
+ {
+ A_Const *constVal = castNode(A_Const, option);
+ String *str = &constVal->val.sval;
+
+ if (curParam != WaitStmtParamLSN &&
+ curParam != WaitStmtParamTimeout)
+ elog(ERROR, "Unexpected string");
+
+ if (curParam == WaitStmtParamLSN)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(str->sval)));
+ }
+ else if (curParam == WaitStmtParamTimeout)
+ {
+ const char *hintmsg;
+ double result;
+
+ if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid value for timeout option: \"%s\"",
+ str->sval),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0));
+ }
+ timeout = (int64) result;
+ }
+
+ curParam = WaitStmtParamNone;
+ }
+
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * pg_wal_replay_wait() is a procedure, not a function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "RESULT STATUS",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index db43034b9db..164fd23017c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
+%type <node> wait_option
+%type <list> wait_option_list
+
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16402,6 +16406,25 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR wait_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
+
+wait_option_list:
+ wait_option { $$ = list_make1($1); }
+ | wait_option_list wait_option { $$ = lappend($1, $2); }
+ ;
+
+wait_option: ColLabel { $$ = (Node *) makeString($1); }
+ | NumericOnly { $$ = (Node *) $1; }
+ | Sconst { $$ = (Node *) makeStringConst($1, @1); }
+
+ ;
/*
* Aggregate decoration clauses
@@ -18050,6 +18073,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18707,6 +18731,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..a1cb9f2473e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 4f4191b0ea6..880fa7807eb 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..58ae9d7f350 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for a replay of the particular WAL position on the physical standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -352,6 +353,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..8d10ece6e8e
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..b44c37aa4db
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..fa5fb1a8897 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 208d2e3a8ed..49060877808 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..3b66af602f0 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -57,6 +57,7 @@ tests += {
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
't/048_vacuum_horizon_floor.pl'
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..f9446cce3f9
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,217 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6f2e93b2d6..037cc85030f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3256,7 +3256,12 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
+WaitStmtParam
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5
Hi,
Thanks for working on this.
I’ve just come across this thread and haven’t had a chance to dig into
the patch yet, but I’m keen to review it soon. In the meantime, I have
a quick question: is WAIT FOR REPLY intended mainly for user-defined
functions, or can internal code invoke it as well?
During a recent performance run [1]/messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com I noticed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
also hints that we could replace this check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by walsender:
/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/
Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when the wait is long. I built
a POC patch that swaps polling for CVs, but a single global CV (or
even separate “flush” and “replay” CVs) isn’t ideal:
The wake-up routines don’t know which LSN each waiter cares about, so
they’d have to broadcast on every flush/replay. Caching the minimum
outstanding LSN could reduce spuriously awakened waiters, yet wouldn’t
eliminate them—multiple backends might wait for different LSNs
simultaneously. A more precise solution would require a request queue
that maps waiters to target LSNs and issues targeted wake-ups, adding
complexity.
Walsender accepts the potential broadcast overhead by using two cvs
for different waiters, so it might be acceptable for
read_local_xlog_page_guts() as well. However, if WAIT FOR REPLY
becomes available to backend code, we might leverage it to eliminate
the polling for waiting replay in read_local_xlog_page_guts() without
introducing a bespoke dispatcher. I’d appreciate any thoughts on
whether that use case is in scope.
Best,
Xuneng
[1]: /messages/by-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com
Hello, Álvaro!
On Wed, Aug 6, 2025 at 6:01 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Apr-29, Alexander Korotkov wrote:
11) WaitLSNProcInfo / WaitLSNState
Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?Hmm, I don't remember why I moved them to xlogwait.h. OK, moved them
back to xlogwait.c.This change made the code no longer compile, because
WaitLSNState->minWaitedLSN is used in xlogrecovery.c which no longer has
access to the field definition. A rebased version with that change
reverted is attached.
Thank you! The rebased version looks correct for me.
------
Regards,
Alexander Korotkov
Supabase
Hi, Xuneng Zhou!
On Thu, Aug 7, 2025 at 6:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Thanks for working on this.
I’ve just come across this thread and haven’t had a chance to dig into
the patch yet, but I’m keen to review it soon.
Great. Thank you for your attention to this patch. I appreciate your
intention to review it.
In the meantime, I have
a quick question: is WAIT FOR REPLY intended mainly for user-defined
functions, or can internal code invoke it as well?
Currently, WaitForLSNReplay() is assumed to only be called from
backend, as corresponding shmem is allocated only per-backend. But
there is absolutely no problem to tweak the patch to allocate shmem
for every Postgres process. This would enable to call
WaitForLSNReplay() wherever it is needed. There is only no problem to
extend this approach to support other kinds of LSNs not just replay
LSN.
During a recent performance run [1] I noticed heavy polling in
read_local_xlog_page_guts(). Heikki’s comment from a few months ago
also hints that we could replace this check–sleep–repeat loop with the
condition-variable (CV) infrastructure used by walsender:/*
* Loop waiting for xlog to be available if necessary
*
* TODO: The walsender has its own version of this function, which uses a
* condition variable to wake up whenever WAL is flushed. We could use the
* same infrastructure here, instead of the check/sleep/repeat style of
* loop.
*/Because read_local_xlog_page_guts() waits for a specific flush or
replay LSN, polling becomes inefficient when the wait is long. I built
a POC patch that swaps polling for CVs, but a single global CV (or
even separate “flush” and “replay” CVs) isn’t ideal:The wake-up routines don’t know which LSN each waiter cares about, so
they’d have to broadcast on every flush/replay. Caching the minimum
outstanding LSN could reduce spuriously awakened waiters, yet wouldn’t
eliminate them—multiple backends might wait for different LSNs
simultaneously. A more precise solution would require a request queue
that maps waiters to target LSNs and issues targeted wake-ups, adding
complexity.Walsender accepts the potential broadcast overhead by using two cvs
for different waiters, so it might be acceptable for
read_local_xlog_page_guts() as well. However, if WAIT FOR REPLY
becomes available to backend code, we might leverage it to eliminate
the polling for waiting replay in read_local_xlog_page_guts() without
introducing a bespoke dispatcher. I’d appreciate any thoughts on
whether that use case is in scope.
This looks like a great new use-case for facilities developed in this
patch! I'll remove the restriction to use WaitForLSNReplay() only in
backend. I think you can write a patch with additional pairing heap
for flush LSN and include that into thread about
read_local_xlog_page_guts() optimization. Let me know if you need any
assistance.
------
Regards,
Alexander Korotkov
Supabase
Hi Alexander!
In the meantime, I have
a quick question: is WAIT FOR REPLY intended mainly for user-defined
functions, or can internal code invoke it as well?Currently, WaitForLSNReplay() is assumed to only be called from
backend, as corresponding shmem is allocated only per-backend. But
there is absolutely no problem to tweak the patch to allocate shmem
for every Postgres process. This would enable to call
WaitForLSNReplay() wherever it is needed. There is only no problem to
extend this approach to support other kinds of LSNs not just replay
LSN.
Thanks for extending the functionality of the Wait For Replay patch!
This looks like a great new use-case for facilities developed in this
patch! I'll remove the restriction to use WaitForLSNReplay() only in
backend. I think you can write a patch with additional pairing heap
for flush LSN and include that into thread about
read_local_xlog_page_guts() optimization. Let me know if you need any
assistance.
This could be a more elegant approach which would solve the polling
issue well. I'll prepare a follow-up patch for it.
Best,
Xuneng
Hi,
On Thu, Aug 7, 2025 at 6:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Thanks for working on this.
I’ve just come across this thread and haven’t had a chance to dig into
the patch yet, but I’m keen to review it soon.Great. Thank you for your attention to this patch. I appreciate your
intention to review it.
I did a quick pass over v7. There are a few thoughts to share—mostly
around documentation, build, and tests, plus some minor nits. The core
logic looks solid to me. I’ll take a deeper look as I work on a
follow‑up patch to add waiting for flush LSNs. And the patch seems to
need rebase; it can't be applied to HEAD cleanly for now.
Build
1) Consider adding a comma in `src/test/recovery/meson.build` after
`'t/048_vacuum_horizon_floor.pl'` so the list remains valid.
Core code
2) It may be safer for `WaitLSNWakeup()` to assert against the stack array size:
) Perhaps `Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);`
rather than `MaxBackends`.
For option parsing UX in `wait.c`, we might prefer:
3) Using `ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR),
errmsg(...)))` instead of `elog(ERROR, ...)` for consistency and
translatability.
4) Explicitly rejecting duplicate `LSN`/`TIMEOUT` options with a syntax error.
5) The result column label could align better with other utility
outputs if shortened to `status` (lowercase, no space).
6) After `parse_real()`, it could help to validate/clamp the timeout
to avoid overflow when converting to `int64` and when passing a `long`
to `WaitLatch()`.
7) If `nodes/print.h` in `src/backend/commands/wait.c` isn’t used, we
might drop the include.
8) A couple of comment nits: “do it this outside” → “do this outside”.
Tests
9) We might consider adding cases for:
- Negative `TIMEOUT` (to exercise the error path).
- Syntax errors (unknown option; duplicate `LSN`/`TIMEOUT`; missing `LSN`).
Documentation
`doc/src/sgml/ref/wait_for.sgml`
10) The index term could be updated to `<primary>WAIT FOR</primary>`.
11) The synopsis might read more clearly as:
- WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds |
'duration-with-units'> ] [ NO_THROW ]
12) The purpose line might be smoother as “wait for a target LSN to be
replayed, optionally with a timeout”.
13) Return values might use `<literal>` for `success`, `timeout`, `not
in recovery`.
14) Consistently calling this a “command” (rather than
function/procedure) could reduce confusion.
15) The example text might read more cleanly as “If the target LSN is
not reached before the timeout …”.
`doc/src/sgml/high-availability.sgml`
16) The sentence could read “However, it is possible to address this
without switching to synchronous replication.”
`src/backend/utils/activity/wait_event_names.txt`
17) The description for `WAIT_FOR_WAL_REPLAY` might be clearer as
“Waiting for WAL replay to reach a target LSN on a standby.”
Best,
Xuneng
Hi all,
I did a rebase for the patch to v8 and incorporated a few changes:
1) Updated documentation, added new tests, and applied minor code
adjustments based on prior review comments.
2) Tweaked the initialization of waitReplayLSNState so that
non-backend processes can call wait for replay.
Started a new thread [1]/messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com and attached a patch addressing the polling
issue in the function
read_local_xlog_page_guts built on the infra of patch v8.
[1]: /messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com
Feedbacks welcome.
Best,
Xuneng
Attachments:
v8-0001-Implement-WAIT-FOR-command.patchapplication/x-patch; name=v8-0001-Implement-WAIT-FOR-command.patchDownload
From 4487999a6c393e42619ae77e5e7f14c6cac9f235 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Wed, 27 Aug 2025 09:12:38 +0800
Subject: [PATCH v8] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
---
doc/src/sgml/high-availability.sgml | 54 +++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 219 ++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 388 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 284 +++++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 29 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 +
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 90 ++++
src/include/commands/wait.h | 21 +
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 269 ++++++++++++
src/tools/pgindent/typedefs.list | 5 +-
30 files changed, 1457 insertions(+), 15 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..ecaff5d5deb 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to the synchronous
+ replication
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..433901baa82
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,219 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed within the specified timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ TIMEOUT <replaceable class="parameter">timeout</replaceable> ] [ NO_THROW ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ as soon as <literal>NO_THROW</literal> is not specified.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Return values</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this function execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+ However, if <function>pg_wal_replay_wait</function> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+ </programlisting>
+ </para>
+
+ <para>
+ It may also happen that target <parameter>lsn</parameter> is not reached
+ within the timeout. In that case the error is thrown.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+ </programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+ </programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..f5257dfa689 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6205,6 +6206,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f23ec8969c2..408454bb8b9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..2cc9312e836
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for the replay of the given LSN on a
+ * physical standby. The core idea is very small: every backend that
+ * wants to wait publishes the LSN it needs to the shared memory, and
+ * the startup process wakes it once that LSN has been replayed.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by a startup
+ * process, timeout is reached, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
+ *
+ * After replaying a WAL record, the startup process first performs a
+ * fast path check minWaitedLSN > replayLSN. If this check is negative,
+ * it checks waitersHeap and wakes up the backend whose awaited LSNs
+ * are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not
+ * yet replayed. Record the process numbers to wake up, but to avoid
+ * holding the lock for too long, send the wakeups only after
+ * releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ }
+ while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed. Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed. Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..cfa42ad6f6c
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "nodes/print.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+ WaitStmtParamNone,
+ WaitStmtParamTimeout,
+ WaitStmtParamLSN
+} WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt * stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ WaitStmtParam curParam = WaitStmtParamNone;
+
+ /*
+ * Process the list of parameters.
+ */
+ bool o_lsn = false;
+ bool o_timeout = false;
+ bool o_no_throw = false;
+
+ foreach_ptr(Node, option, stmt->options)
+ {
+ if (IsA(option, String))
+ {
+ String *str = castNode(String, option);
+ char *name = str_tolower(str->sval, strlen(str->sval),
+ DEFAULT_COLLATION_OID);
+
+ if (curParam != WaitStmtParamNone)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected parameter after \"%s\"", name)));
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ if (o_lsn)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("parameter \"%s\" specified more than once", "lsn")));
+ o_lsn = true;
+ curParam = WaitStmtParamLSN;
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ if (o_timeout)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("parameter \"%s\" specified more than once", "timeout")));
+ o_timeout = true;
+ curParam = WaitStmtParamTimeout;
+ }
+ else if (strcmp(name, "no_throw") == 0)
+ {
+ if (o_no_throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("parameter \"%s\" specified more than once", "no_throw")));
+ o_no_throw = true;
+ throw = false;
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized parameter \"%s\"", name)));
+
+ }
+ else if (IsA(option, Integer))
+ {
+ Integer *intVal = castNode(Integer, option);
+
+ if (curParam != WaitStmtParamTimeout)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected integer value")));
+
+ timeout = intVal->ival;
+
+ curParam = WaitStmtParamNone;
+ }
+ else if (IsA(option, A_Const))
+ {
+ A_Const *constVal = castNode(A_Const, option);
+ String *str = &constVal->val.sval;
+
+ if (curParam != WaitStmtParamLSN &&
+ curParam != WaitStmtParamTimeout)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected string value")));
+
+ if (curParam == WaitStmtParamLSN)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(str->sval)));
+ }
+ else if (curParam == WaitStmtParamTimeout)
+ {
+ const char *hintmsg;
+ double result;
+
+ if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid value for timeout option: \"%s\"",
+ str->sval),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0));
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we don't fail
+ * on just-out-of-range values that would round into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range for type bigint")));
+
+ timeout = (int64) result;
+ }
+
+ curParam = WaitStmtParamNone;
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected parameter type")));
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * pg_wal_replay_wait() is a procedure, not a function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt * stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index db43034b9db..164fd23017c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
+%type <node> wait_option
+%type <list> wait_option_list
+
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16402,6 +16406,25 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR wait_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
+
+wait_option_list:
+ wait_option { $$ = list_make1($1); }
+ | wait_option_list wait_option { $$ = lappend($1, $2); }
+ ;
+
+wait_option: ColLabel { $$ = (Node *) makeString($1); }
+ | NumericOnly { $$ = (Node *) $1; }
+ | Sconst { $$ = (Node *) makeStringConst($1, @1); }
+
+ ;
/*
* Aggregate decoration clauses
@@ -18050,6 +18073,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18707,6 +18731,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9ef0fbfe32..a1cb9f2473e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 4f4191b0ea6..880fa7807eb 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 5427da5bc1b..ee20a48b2c5 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -352,6 +353,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..72be2f76293
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /* A pairing heap node for participation in waitLSNState->waitersHeap */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState * waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ef9e5f0c0be
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt * stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt * stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..b8d3fc009fb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..da1cfeb1c52
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,269 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT -1000;",
+ stderr => \$stderr);
+ok($stderr =~ /timeout.*must not be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' UNKNOWN_PARAM;",
+ stderr => \$stderr);
+ok($stderr =~ /unrecognized parameter/,
+ "get error for unknown parameter");
+
+# Test duplicate LSN parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' LSN '${test_lsn}';",
+ stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+ "get error for duplicate LSN parameter");
+
+# Test duplicate TIMEOUT parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT 1000 TIMEOUT 2000;",
+ stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' NO_THROW NO_THROW;",
+ stderr => \$stderr);
+ok($stderr =~ /parameter.*specified more than once/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;",
+ stderr => \$stderr);
+ok($stderr =~ /lsn.*must be specified/,
+ "get error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql('postgres', "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql('postgres', "WAIT FOR LSN '${test_lsn}' TIMEOUT 'invalid';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid value for timeout option/,
+ "get error for invalid timeout format");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..f303f04d007 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -615,7 +615,6 @@ DatumTupleFields
DbInfo
DbInfoArr
DbLocaleInfo
-DbOidName
DeClonePtrType
DeadLockState
DeallocateStmt
@@ -2283,7 +2282,6 @@ PlannerParamItem
Point
Pointer
PolicyInfo
-PolyNumAggState
Pool
PopulateArrayContext
PopulateArrayState
@@ -4129,6 +4127,7 @@ tar_file
td_entry
teSection
temp_tablespaces_extra
+test128
test_re_flags
test_regex_ctx
test_shm_mq_header
@@ -4198,6 +4197,7 @@ varatt_expanded
varattrib_1b
varattrib_1b_e
varattrib_4b
+vartag_external
vbits
verifier_context
walrcv_alter_slot_fn
@@ -4326,7 +4326,6 @@ xmlGenericErrorFunc
xmlNodePtr
xmlNodeSetPtr
xmlParserCtxtPtr
-xmlParserErrors
xmlParserInputPtr
xmlSaveCtxt
xmlSaveCtxtPtr
--
2.49.0
Hi, Xuneng!
On Wed, Aug 27, 2025 at 6:54 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
I did a rebase for the patch to v8 and incorporated a few changes:
1) Updated documentation, added new tests, and applied minor code
adjustments based on prior review comments.
2) Tweaked the initialization of waitReplayLSNState so that
non-backend processes can call wait for replay.Started a new thread [1] and attached a patch addressing the polling
issue in the function
read_local_xlog_page_guts built on the infra of patch v8.[1] /messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com
Feedbacks welcome.
Thank you for your reviewing and revising this patch.
I see you've integrated most of your points expressed in [1]. I went
though them and I've integrated the rest of them. Except this one.
11) The synopsis might read more clearly as:
- WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds | 'duration-with-units'> ] [ NO_THROW ]
I didn't find examples on how we do the similar things on other places
of docs. This is why I decided to leave this place as it currently
is.
Also, I found some mess up with typedefs.list. I've returned the
changes to typdefs.list back and re-indented the sources.
I'd like to ask your opinion of the way this feature is implemented in
terms of grammar: generic parsing implemented in gram.y and the rest
is done in wait.c. I think this approach should minimize additional
keywords and states for parsing code. This comes at the price of more
complex code in wait.c, but I think this is a fair price.
Links.
1. /messages/by-id/CABPTF7VsoGDMBq34MpLrMSZyxNZvVbgH6-zxtJOg5AwOoYURbw@mail.gmail.com
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v9-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v9-0001-Implement-WAIT-FOR-command.patchDownload
From 70fff63c02e85a197b727da1657bd24595fc8132 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Sun, 24 Aug 2025 20:10:37 +0800
Subject: [PATCH v9] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
---
doc/src/sgml/high-availability.sgml | 54 +++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 218 ++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 388 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 284 +++++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 29 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 +
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 93 +++++
src/include/commands/wait.h | 21 +
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 281 +++++++++++++
src/tools/pgindent/typedefs.list | 5 +
30 files changed, 1474 insertions(+), 12 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..328ce7fe8ed
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,218 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ TIMEOUT <replaceable class="parameter">timeout</replaceable> ] [ NO_THROW ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ as soon as <literal>NO_THROW</literal> is not specified.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Return values</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+ </programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+ </programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+ </programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0baf0ac6160..7a078730e28 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6205,6 +6206,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 346319338a0..e709b7392cf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for the replay of the given LSN on a
+ * physical standby. The core idea is very small: every backend that
+ * wants to wait publishes the LSN it needs to the shared memory, and
+ * the startup process wakes it once that LSN has been replayed.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by a startup
+ * process, timeout is reached, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
+ *
+ * After replaying a WAL record, the startup process first performs a
+ * fast path check minWaitedLSN > replayLSN. If this check is negative,
+ * it checks waitersHeap and wakes up the backend whose awaited LSNs
+ * are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not
+ * yet replayed. Record the process numbers to wake up, but to avoid
+ * holding the lock for too long, send the wakeups only after
+ * releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ }
+ while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed. Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed. Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..1d59ddd81aa
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+ WaitStmtParamNone,
+ WaitStmtParamTimeout,
+ WaitStmtParamLSN
+} WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ WaitStmtParam curParam = WaitStmtParamNone;
+
+ /*
+ * Process the list of parameters.
+ */
+ bool haveLsn = false;
+ bool haveTimeout = false;
+ bool haveNoThrow = false;
+
+ foreach_ptr(Node, option, stmt->options)
+ {
+ if (IsA(option, String))
+ {
+ String *str = castNode(String, option);
+ char *name = str_tolower(str->sval, strlen(str->sval),
+ DEFAULT_COLLATION_OID);
+
+ if (curParam != WaitStmtParamNone)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected parameter after \"%s\"", name)));
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ if (haveLsn)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("parameter \"%s\" specified more than once", "lsn")));
+ haveLsn = true;
+ curParam = WaitStmtParamLSN;
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ if (haveTimeout)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("parameter \"%s\" specified more than once", "timeout")));
+ haveTimeout = true;
+ curParam = WaitStmtParamTimeout;
+ }
+ else if (strcmp(name, "no_throw") == 0)
+ {
+ if (haveNoThrow)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("parameter \"%s\" specified more than once", "no_throw")));
+ haveNoThrow = true;
+ throw = false;
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized parameter \"%s\"", name)));
+
+ }
+ else if (IsA(option, Integer))
+ {
+ Integer *intVal = castNode(Integer, option);
+
+ if (curParam != WaitStmtParamTimeout)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected integer value")));
+
+ timeout = intVal->ival;
+
+ curParam = WaitStmtParamNone;
+ }
+ else if (IsA(option, A_Const))
+ {
+ A_Const *constVal = castNode(A_Const, option);
+ String *str = &constVal->val.sval;
+
+ if (curParam != WaitStmtParamLSN &&
+ curParam != WaitStmtParamTimeout)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected string value")));
+
+ if (curParam == WaitStmtParamLSN)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(str->sval)));
+ }
+ else if (curParam == WaitStmtParamTimeout)
+ {
+ const char *hintmsg;
+ double result;
+
+ if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid value for timeout option: \"%s\"",
+ str->sval),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0));
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round
+ * into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range for type bigint")));
+
+ timeout = (int64) result;
+ }
+
+ curParam = WaitStmtParamNone;
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected parameter type")));
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * WAIT FOR is a comment, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..8675dfd2e99 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
+%type <node> wait_option
+%type <list> wait_option_list
+
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16403,6 +16407,25 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR wait_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
+
+wait_option_list:
+ wait_option { $$ = list_make1($1); }
+ | wait_option_list wait_option { $$ = lappend($1, $2); }
+ ;
+
+wait_option: ColLabel { $$ = (Node *) makeString($1); }
+ | NumericOnly { $$ = (Node *) $1; }
+ | Sconst { $$ = (Node *) makeStringConst($1, @1); }
+
+ ;
/*
* Aggregate decoration clauses
@@ -18051,6 +18074,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18708,6 +18732,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 5f442bc3bd4..398f4d2b363 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2060,6 +2068,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2115,6 +2126,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3092,6 +3106,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3690,6 +3708,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +356,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitReplayLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /*
+ * A pairing heap node for participation in
+ * waitReplayLSNState->waitersHeap
+ */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..b44c37aa4db
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..fa5fb1a8897 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..9d06b5c060f
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,281 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' TIMEOUT -1000;",
+ stderr => \$stderr);
+ok($stderr =~ /timeout.*must not be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' UNKNOWN_PARAM;",
+ stderr => \$stderr);
+ok($stderr =~ /unrecognized parameter/, "get error for unknown parameter");
+
+# Test duplicate LSN parameter
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' LSN '${test_lsn}';",
+ stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+ "get error for duplicate LSN parameter");
+
+# Test duplicate TIMEOUT parameter
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' TIMEOUT 1000 TIMEOUT 2000;",
+ stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' NO_THROW NO_THROW;",
+ stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /lsn.*must be specified/, "get error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' TIMEOUT 'invalid';",
+ stderr => \$stderr);
+ok( $stderr =~ /invalid value for timeout option/,
+ "get error for invalid timeout format");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..49dab055752 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3255,7 +3255,12 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
+WaitStmtParam
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5 (Apple Git-154)
Hi Alexander,
On Sun, Sep 14, 2025 at 3:31 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi, Xuneng!
On Wed, Aug 27, 2025 at 6:54 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
I did a rebase for the patch to v8 and incorporated a few changes:
1) Updated documentation, added new tests, and applied minor code
adjustments based on prior review comments.
2) Tweaked the initialization of waitReplayLSNState so that
non-backend processes can call wait for replay.Started a new thread [1] and attached a patch addressing the polling
issue in the function
read_local_xlog_page_guts built on the infra of patch v8.[1] /messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com
Feedbacks welcome.
Thank you for your reviewing and revising this patch.
I see you've integrated most of your points expressed in [1]. I went
though them and I've integrated the rest of them. Except this one.11) The synopsis might read more clearly as:
- WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds | 'duration-with-units'> ] [ NO_THROW ]I didn't find examples on how we do the similar things on other places
of docs. This is why I decided to leave this place as it currently
is.
+1. I re-check other commands with similar parameter patterns, and
they follow the approach in v9.
Also, I found some mess up with typedefs.list. I've returned the
changes to typdefs.list back and re-indented the sources.
Thanks for catching and fixing that.
I'd like to ask your opinion of the way this feature is implemented in
terms of grammar: generic parsing implemented in gram.y and the rest
is done in wait.c. I think this approach should minimize additional
keywords and states for parsing code. This comes at the price of more
complex code in wait.c, but I think this is a fair price.
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.
Best,
Xuneng
Hi, Xuneng!
On Sun, Sep 14, 2025 at 4:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Sun, Sep 14, 2025 at 3:31 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Aug 27, 2025 at 6:54 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
I did a rebase for the patch to v8 and incorporated a few changes:
1) Updated documentation, added new tests, and applied minor code
adjustments based on prior review comments.
2) Tweaked the initialization of waitReplayLSNState so that
non-backend processes can call wait for replay.Started a new thread [1] and attached a patch addressing the polling
issue in the function
read_local_xlog_page_guts built on the infra of patch v8.[1] /messages/by-id/CABPTF7Vr99gZ5GM_ZYbYnd9MMnoVW3pukBEviVoHKRvJW-dE3g@mail.gmail.com
Feedbacks welcome.
Thank you for your reviewing and revising this patch.
I see you've integrated most of your points expressed in [1]. I went
though them and I've integrated the rest of them. Except this one.11) The synopsis might read more clearly as:
- WAIT FOR LSN '<lsn>' [ TIMEOUT <milliseconds | 'duration-with-units'> ] [ NO_THROW ]I didn't find examples on how we do the similar things on other places
of docs. This is why I decided to leave this place as it currently
is.+1. I re-check other commands with similar parameter patterns, and
they follow the approach in v9.Also, I found some mess up with typedefs.list. I've returned the
changes to typdefs.list back and re-indented the sources.Thanks for catching and fixing that.
I'd like to ask your opinion of the way this feature is implemented in
terms of grammar: generic parsing implemented in gram.y and the rest
is done in wait.c. I think this approach should minimize additional
keywords and states for parsing code. This comes at the price of more
complex code in wait.c, but I think this is a fair price.It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.
The attached revision of patch contains fix of the typo in the comment
you reported off-list.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v10-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v10-0001-Implement-WAIT-FOR-command.patchDownload
From 63c1d54b6a2933167271277dc6ed3c3af70dd703 Mon Sep 17 00:00:00 2001
From: alterego665 <824662526@qq.com>
Date: Sun, 24 Aug 2025 20:10:37 +0800
Subject: [PATCH v10] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
---
doc/src/sgml/high-availability.sgml | 54 +++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 218 ++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 388 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 284 +++++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 29 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 +
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 93 +++++
src/include/commands/wait.h | 21 +
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 7 +
src/include/parser/kwlist.h | 1 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 281 +++++++++++++
src/tools/pgindent/typedefs.list | 5 +
30 files changed, 1474 insertions(+), 12 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..328ce7fe8ed
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,218 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ TIMEOUT <replaceable class="parameter">timeout</replaceable> ] [ NO_THROW ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ as soon as <literal>NO_THROW</literal> is not specified.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Options</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>LSN</literal> '<replaceable class="parameter">lsn</replaceable>'</term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> <replaceable class="parameter">timeout</replaceable></term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Return values</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out as soon as <parameter>NO_THROW</parameter> is not specified.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+ </programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT '0.1 s';
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+ </programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' TIMEOUT 100 NO_THROW;
+ RESULT STATUS
+---------------
+ timeout
+(1 row)
+ </programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..7eb4625c5e9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0baf0ac6160..7a078730e28 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6205,6 +6206,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 346319338a0..e709b7392cf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1837,6 +1838,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for the replay of the given LSN on a
+ * physical standby. The core idea is very small: every backend that
+ * wants to wait publishes the LSN it needs to the shared memory, and
+ * the startup process wakes it once that LSN has been replayed.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by a startup
+ * process, timeout is reached, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
+ *
+ * After replaying a WAL record, the startup process first performs a
+ * fast path check minWaitedLSN > replayLSN. If this check is negative,
+ * it checks waitersHeap and wakes up the backend whose awaited LSNs
+ * are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not
+ * yet replayed. Record the process numbers to wake up, but to avoid
+ * holding the lock for too long, send the wakeups only after
+ * releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ }
+ while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed. Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed. Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..ffcc0bbf457
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "catalog/pg_collation_d.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/fmgrprotos.h"
+#include "utils/formatting.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+typedef enum
+{
+ WaitStmtParamNone,
+ WaitStmtParamTimeout,
+ WaitStmtParamLSN
+} WaitStmtParam;
+
+void
+ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn = InvalidXLogRecPtr;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ WaitStmtParam curParam = WaitStmtParamNone;
+
+ /*
+ * Process the list of parameters.
+ */
+ bool haveLsn = false;
+ bool haveTimeout = false;
+ bool haveNoThrow = false;
+
+ foreach_ptr(Node, option, stmt->options)
+ {
+ if (IsA(option, String))
+ {
+ String *str = castNode(String, option);
+ char *name = str_tolower(str->sval, strlen(str->sval),
+ DEFAULT_COLLATION_OID);
+
+ if (curParam != WaitStmtParamNone)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected parameter after \"%s\"", name)));
+
+ if (strcmp(name, "lsn") == 0)
+ {
+ if (haveLsn)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("parameter \"%s\" specified more than once", "lsn")));
+ haveLsn = true;
+ curParam = WaitStmtParamLSN;
+ }
+ else if (strcmp(name, "timeout") == 0)
+ {
+ if (haveTimeout)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("parameter \"%s\" specified more than once", "timeout")));
+ haveTimeout = true;
+ curParam = WaitStmtParamTimeout;
+ }
+ else if (strcmp(name, "no_throw") == 0)
+ {
+ if (haveNoThrow)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("parameter \"%s\" specified more than once", "no_throw")));
+ haveNoThrow = true;
+ throw = false;
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unrecognized parameter \"%s\"", name)));
+
+ }
+ else if (IsA(option, Integer))
+ {
+ Integer *intVal = castNode(Integer, option);
+
+ if (curParam != WaitStmtParamTimeout)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected integer value")));
+
+ timeout = intVal->ival;
+
+ curParam = WaitStmtParamNone;
+ }
+ else if (IsA(option, A_Const))
+ {
+ A_Const *constVal = castNode(A_Const, option);
+ String *str = &constVal->val.sval;
+
+ if (curParam != WaitStmtParamLSN &&
+ curParam != WaitStmtParamTimeout)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected string value")));
+
+ if (curParam == WaitStmtParamLSN)
+ {
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(str->sval)));
+ }
+ else if (curParam == WaitStmtParamTimeout)
+ {
+ const char *hintmsg;
+ double result;
+
+ if (!parse_real(str->sval, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid value for timeout option: \"%s\"",
+ str->sval),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0));
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round
+ * into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range for type bigint")));
+
+ timeout = (int64) result;
+ }
+
+ curParam = WaitStmtParamNone;
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("unexpected parameter type")));
+ }
+
+ if (XLogRecPtrIsInvalid(lsn))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_PARAMETER),
+ errmsg("\"lsn\" must be specified")));
+
+ if (timeout < 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("\"timeout\" must not be negative")));
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * WAIT FOR is a command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("Make sure WAIT FOR isn't called within a transaction with an isolation level higher than READ COMMITTED, procedure, or a function.")));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ (errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL)))));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery.")));
+ }
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..8675dfd2e99 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -671,6 +671,9 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
+%type <node> wait_option
+%type <list> wait_option_list
+
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
@@ -785,7 +788,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1116,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16403,6 +16407,25 @@ xml_passing_mech:
| BY VALUE_P
;
+WaitStmt:
+ WAIT FOR wait_option_list
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->options = $3;
+ $$ = (Node *)n;
+ }
+ ;
+
+wait_option_list:
+ wait_option { $$ = list_make1($1); }
+ | wait_option_list wait_option { $$ = lappend($1, $2); }
+ ;
+
+wait_option: ColLabel { $$ = (Node *) makeString($1); }
+ | NumericOnly { $$ = (Node *) $1; }
+ | Sconst { $$ = (Node *) makeStringConst($1, @1); }
+
+ ;
/*
* Aggregate decoration clauses
@@ -18051,6 +18074,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18708,6 +18732,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 5f442bc3bd4..398f4d2b363 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt((WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2060,6 +2068,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2115,6 +2126,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3092,6 +3106,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3690,6 +3708,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +356,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitReplayLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /*
+ * A pairing heap node for participation in
+ * waitReplayLSNState->waitersHeap
+ */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..b44c37aa4db
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..fa5fb1a8897 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4364,4 +4364,11 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ List *options;
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..dec7ec6e5ec 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -494,6 +494,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..9d06b5c060f
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,281 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' TIMEOUT '1d';
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' TIMEOUT 10;");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' TIMEOUT 1000;",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' TIMEOUT '0.1 s' NO_THROW;]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' TIMEOUT 10 NO_THROW;]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' TIMEOUT -1000;",
+ stderr => \$stderr);
+ok($stderr =~ /timeout.*must not be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' UNKNOWN_PARAM;",
+ stderr => \$stderr);
+ok($stderr =~ /unrecognized parameter/, "get error for unknown parameter");
+
+# Test duplicate LSN parameter
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' LSN '${test_lsn}';",
+ stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+ "get error for duplicate LSN parameter");
+
+# Test duplicate TIMEOUT parameter
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' TIMEOUT 1000 TIMEOUT 2000;",
+ stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' NO_THROW NO_THROW;",
+ stderr => \$stderr);
+ok( $stderr =~ /parameter.*specified more than once/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /lsn.*must be specified/, "get error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' TIMEOUT 'invalid';",
+ stderr => \$stderr);
+ok( $stderr =~ /invalid value for timeout option/,
+ "get error for invalid timeout format");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR lsn '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' TIMEOUT '10ms' NO_THROW;]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..49dab055752 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3255,7 +3255,12 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
+WaitStmtParam
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5 (Apple Git-154)
On 2025-Sep-15, Alexander Korotkov wrote:
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.
Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.
Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something like
WAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Hi Álvaro,
Thanks for your review.
On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Sep-15, Alexander Korotkov wrote:
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.
I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.
Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something likeWAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.
Shall make LSN an unreserved keyword as well.
Best,
Xuneng
Hi,
On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Álvaro,
Thanks for your review.
On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Sep-15, Alexander Korotkov wrote:
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something likeWAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.Shall make LSN an unreserved keyword as well.
Here's the updated v11. Many thanks Jian for off-list discussions and review.
Best,
Xuneng
Attachments:
v11-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v11-0001-Implement-WAIT-FOR-command.patchDownload
From 0ee9a9275cd811f70a49560e0715556820fb81be Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sat, 27 Sep 2025 23:26:22 +0800
Subject: [PATCH v11] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
---
doc/src/sgml/high-availability.sgml | 54 +++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 388 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 ++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 33 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 +
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 93 +++++
src/include/commands/wait.h | 22 +
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 293 +++++++++++++
src/tools/pgindent/typedefs.list | 5 +
30 files changed, 1435 insertions(+), 14 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 109713315c0..36b8ac6b855 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6222,6 +6223,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..824b0942b34 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for the replay of the given LSN on a
+ * physical standby. The core idea is very small: every backend that
+ * wants to wait publishes the LSN it needs to the shared memory, and
+ * the startup process wakes it once that LSN has been replayed.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by a startup
+ * process, timeout is reached, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
+ *
+ * After replaying a WAL record, the startup process first performs a
+ * fast path check minWaitedLSN > replayLSN. If this check is negative,
+ * it checks waitersHeap and wakes up the backend whose awaited LSNs
+ * are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not
+ * yet replayed. Record the process numbers to wake up, but to avoid
+ * holding the lock for too long, send the wakeups only after
+ * releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ }
+ while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed. Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed. Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round
+ * into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * WAIT FOR is a command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..fd95f24fa74 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -319,6 +319,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -671,7 +672,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -741,7 +741,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -785,7 +785,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1113,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16403,6 +16404,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ opt_with '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17882,6 +17903,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18051,6 +18073,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18497,6 +18520,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18708,6 +18732,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +356,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitReplayLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /*
+ * A pairing heap node for participation in
+ * waitReplayLSNState->waitersHeap
+ */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f1706df58fd..997c72ab858 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4363,4 +4363,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..69a81e21fbb 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -269,6 +269,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -494,6 +495,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d5a80b4359f..ac0252936be 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3257,7 +3257,12 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
+WaitStmt
+WaitStmtParam
WalCloseMethod
WalCompression
WalInsertClass
--
2.51.0
Hi,
On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Álvaro,
Thanks for your review.
On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Sep-15, Alexander Korotkov wrote:
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something likeWAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.Shall make LSN an unreserved keyword as well.
Here's the updated v11. Many thanks Jian for off-list discussions and review.
v12 removed unused
+WaitStmt
+WaitStmtParam in pgindent/typedefs.list.
Best,
Xuneng
Attachments:
v12-0001-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v12-0001-Implement-WAIT-FOR-command.patchDownload
From d6fbbb3b0ad81c18657e6fafa50852bc9bf239e2 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sat, 27 Sep 2025 23:26:22 +0800
Subject: [PATCH v12] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
---
doc/src/sgml/high-availability.sgml | 54 +++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 388 ++++++++++++++++++
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 ++++++++++
src/backend/lib/pairingheap.c | 18 +-
src/backend/parser/gram.y | 33 +-
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 +
.../utils/activity/wait_event_names.txt | 2 +
src/include/access/xlogwait.h | 93 +++++
src/include/commands/wait.h | 22 +
src/include/lib/pairingheap.h | 3 +
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/storage/lwlocklist.h | 1 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 293 +++++++++++++
src/tools/pgindent/typedefs.list | 3 +
30 files changed, 1433 insertions(+), 14 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/access/xlogwait.h
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 109713315c0..36b8ac6b855 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6222,6 +6223,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..824b0942b34 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+ WaitLSNWakeup(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4d831fbfa74
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,388 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for the given replay LSN, which is used in
+ * WAIT FOR lsn '...'
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for the replay of the given LSN on a
+ * physical standby. The core idea is very small: every backend that
+ * wants to wait publishes the LSN it needs to the shared memory, and
+ * the startup process wakes it once that LSN has been replayed.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by a startup
+ * process, timeout is reached, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
+ *
+ * After replaying a WAL record, the startup process first performs a
+ * fast path check minWaitedLSN > replayLSN. If this check is negative,
+ * it checks waitersHeap and wakes up the backend whose awaited LSNs
+ * are reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for waitReplayLSN->waitersHeap heap. Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update waitReplayLSN->minWaitedLSN according to the current state of
+ * waitReplayLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+ minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+ }
+
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ Assert(!procInfo->inHeap);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = true;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (!procInfo->inHeap)
+ {
+ LWLockRelease(WaitLSNLock);
+ return;
+ }
+
+ pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+ procInfo->inHeap = false;
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+void
+WaitLSNWakeup(XLogRecPtr currentLSN)
+{
+ int i;
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the pairing heap of waiting processes till we find LSN not
+ * yet replayed. Record the process numbers to wake up, but to avoid
+ * holding the lock for too long, send the wakeups only after
+ * releasing the lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) &&
+ procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+ procInfo->inHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN();
+
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already replayed.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ /* Need to recheck if there were more waiters than static array size. */
+ }
+ while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+ /*
+ * We do a fast-path check of the 'inHeap' flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, a timeout happens, the
+ * replica gets promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed. Returns
+ * WAIT_LSN_RESULT_TIMEOUT if the timeout was reached before the target LSN
+ * replayed. Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery is not in progress. Given that we detected this in the
+ * very first check, this procedure was mistakenly called on primary.
+ * However, it's possible that standby was promoted concurrently to
+ * the procedure call, while target LSN is replayed. So, we still
+ * check the last replay LSN before reporting an error.
+ */
+ if (PromoteIsTriggered() && targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* If target LSN is already replayed, exit immediately */
+ if (targetLSN <= GetXLogReplayRecPtr(NULL))
+ return WAIT_LSN_RESULT_SUCCESS;
+ }
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the pairing heap of waiters. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter();
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory pairing heap. We might
+ * already be deleted by the startup process. The 'inHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter();
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round
+ * into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * WAIT FOR is a command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 9fd48acb1f8..fd95f24fa74 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -302,7 +302,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -319,6 +319,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -671,7 +672,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -741,7 +741,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -785,7 +785,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1113,6 +1113,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16403,6 +16404,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ opt_with '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17882,6 +17903,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18051,6 +18073,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18497,6 +18520,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18708,6 +18732,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..eb77924c4be 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,7 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +356,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..df8202528b9
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,93 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay. An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is replayed */
+ ProcNumber procno;
+
+ /*
+ * A flag indicating that this item is present in
+ * waitReplayLSNState->waitersHeap
+ */
+ bool inHeap;
+
+ /*
+ * A pairing heap node for participation in
+ * waitReplayLSNState->waitersHeap
+ */
+ pairingheap_node phNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN;
+
+ /*
+ * A pairing heap of waiting processes order by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f1706df58fd..997c72ab858 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4363,4 +4363,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..69a81e21fbb 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -269,6 +269,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -494,6 +495,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d5a80b4359f..e6ff42b9ea0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3257,6 +3257,9 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
WalCloseMethod
WalCompression
--
2.51.0
Hi,
On Sat, Oct 4, 2025 at 9:35 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Álvaro,
Thanks for your review.
On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Sep-15, Alexander Korotkov wrote:
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something likeWAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.Shall make LSN an unreserved keyword as well.
Here's the updated v11. Many thanks Jian for off-list discussions and review.
v12 removed unused +WaitStmt +WaitStmtParam in pgindent/typedefs.list.
Hi, I’ve split the patch into multiple patch sets for easier review,
per Michael’s advice [1]/messages/by-id/aOMsv9TszlB1n-W7@paquier.xyz.
[1]: /messages/by-id/aOMsv9TszlB1n-W7@paquier.xyz
Best,
Xuneng
Attachments:
v13-0003-Implement-WAIT-FOR-command.patchapplication/x-patch; name=v13-0003-Implement-WAIT-FOR-command.patchDownload
From c3dd9972d8043c07247bb3e2b476026268ee1bad Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 14 Oct 2025 20:50:04 +0800
Subject: [PATCH v13 3/3] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
---
doc/src/sgml/high-availability.sgml | 54 ++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 27 +-
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 ++++++++++++++++
src/backend/parser/gram.y | 33 ++-
src/backend/storage/lmgr/proc.c | 5 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
src/include/access/xlogwait.h | 3 +-
src/include/commands/wait.h | 22 ++
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 293 ++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 3 +
23 files changed, 951 insertions(+), 13 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..1859d2084e8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+ WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index a114738bddf..7c8134f1209 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -373,9 +373,10 @@ WaitLSNCleanup(void)
* or replica got promoted before the target LSN replayed.
*/
WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
{
XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
/* Shouldn't be called when shmem isn't initialized */
@@ -404,6 +405,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
return WAIT_LSN_RESULT_SUCCESS;
}
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
/*
* Add our process to the replay waiters heap. It might happen that
* target LSN gets replayed before we do. Another check at the beginning
@@ -438,6 +445,18 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
break;
}
+ /*
+ * If the timeout value is specified, calculate the number of
+ * milliseconds before the timeout. Exit if the timeout is already
+ * reached.
+ */
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
CHECK_FOR_INTERRUPTS();
rc = WaitLatch(MyLatch, wake_events, delay_ms,
@@ -464,6 +483,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
*/
deleteLSNWaiter(WAIT_LSN_REPLAY);
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
return WAIT_LSN_RESULT_SUCCESS;
}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round
+ * into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * WAIT FOR is a command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 21caf2d43bf..1d016df1f6b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ opt_with '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..f8685fa9039 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -947,6 +947,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 441bf475b4d..2e33a1d22d0 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
* wait */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
} WaitLSNResult;
/*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
extern void WaitForLSNFlush(XLogRecPtr targetLSN);
#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index dc09d1a3f03..c741099e186 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
WalCloseMethod
WalCompression
--
2.51.0
v13-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/x-patch; name=v13-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 32dab7ed64eecb62adce6b1d124b1fa389515e74 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 10 Oct 2025 16:35:38 +0800
Subject: [PATCH v13 2/3] Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xlogwait.c | 525 ++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
.../utils/activity/wait_event_names.txt | 3 +
src/include/access/xlogwait.h | 112 ++++
src/include/storage/lwlocklist.h | 1 +
7 files changed, 647 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/include/access/xlogwait.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..4faed65765c
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,525 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for WAL operations to reach specific LSNs.
+ * Used by internal WAL reading operations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for WAL operations to reach specific LSNs
+ * on both physical standby and primary servers. The core idea is simple:
+ * every process that wants to wait publishes the LSN it needs to the
+ * shared memory, and the appropriate process (startup on standby, or
+ * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by the appropriate
+ * process, standby is promoted, or the postmaster dies. Then, it cleans
+ * information about itself in the shared memory.
+ *
+ * On standby servers: After replaying a WAL record, the startup process
+ * first performs a fast path check minWaitedLSN > replayLSN. If this
+ * check is negative, it checks waitersHeap and wakes up the backend
+ * whose awaited LSNs are reached.
+ *
+ * On primary servers: After flushing WAL, the WAL writer or backend
+ * process performs a similar check against the flush LSN and wakes up
+ * waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+static int waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ /* Initialize replay heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+ /* Initialize flush heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+ /* Initialize process info array */
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+ }
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ Assert(!procInfo->inReplayHeap);
+ pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ Assert(!procInfo->inFlushHeap);
+ pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+ {
+ pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+ {
+ pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+ int i;
+ pairingheap *heap;
+
+ /* Select appropriate heap */
+ heap = (operation == WAIT_LSN_REPLAY) ?
+ &waitLSNState->replayWaitersHeap :
+ &waitLSNState->flushWaitersHeap;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the waiters heap until we find LSN not yet reached.
+ * Record process numbers to wake up, but send wakeups after releasing lock.
+ */
+ while (!pairingheap_is_empty(heap))
+ {
+ pairingheap_node *node = pairingheap_first(heap);
+ WaitLSNProcInfo *procInfo;
+
+ /* Get procInfo using appropriate heap node */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ else
+ procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(heap);
+
+ /* Update appropriate flag */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo->inReplayHeap = false;
+ else
+ procInfo->inFlushHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN(operation);
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already reached.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ } while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+ if (waitLSNState)
+ {
+ /*
+ * We do a fast-path check of the heap flags without the lock. These
+ * flags are set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+ if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+ }
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ /*
+ * Add our process to the replay waiters heap. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = 0;
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory replay heap. We might
+ * already be deleted by the startup process. The 'inReplayHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+ /* We can only wait for flush when we are not in recovery */
+ Assert(!RecoveryInProgress());
+
+ /* Quick exit if already flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ return;
+
+ /* Add to flush waiters */
+ addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+ /* Wait loop */
+ for (;;)
+ {
+ int rc;
+
+ /* Check if the waited LSN has been flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, -1,
+ WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN flush")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory flush heap. We might
+ * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+ return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +357,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..441bf475b4d
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+ WAIT_LSN_REPLAY, /* Waiting for replay on standby */
+ WAIT_LSN_FLUSH /* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations. An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is reached */
+ ProcNumber procno;
+
+ /* Type-safe heap membership flags */
+ bool inReplayHeap; /* In replay waiters heap */
+ bool inFlushHeap; /* In flush waiters heap */
+
+ /* Separate heap nodes for type safety */
+ pairingheap_node replayHeapNode;
+ pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum replay LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedReplayLSN;
+
+ /*
+ * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap replayWaitersHeap;
+
+ /*
+ * The minimum flush LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after flushing
+ * WAL. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedFlushLSN;
+
+ /*
+ * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap flushWaitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
--
2.51.0
v13-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchapplication/x-patch; name=v13-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchDownload
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v13 1/3] Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
src/include/lib/pairingheap.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
--
2.51.0
Hi,
On Tue, Oct 14, 2025 at 9:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Sat, Oct 4, 2025 at 9:35 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Álvaro,
Thanks for your review.
On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Sep-15, Alexander Korotkov wrote:
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something likeWAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.Shall make LSN an unreserved keyword as well.
Here's the updated v11. Many thanks Jian for off-list discussions and review.
v12 removed unused +WaitStmt +WaitStmtParam in pgindent/typedefs.list.Hi, I’ve split the patch into multiple patch sets for easier review,
per Michael’s advice [1].
Patch 2 in v13 is corrupted and patch 3 has an error. Sorry for the
noise. Here's v14.
Best,
Xuneng
Attachments:
v14-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v14-0003-Implement-WAIT-FOR-command.patchDownload
From 40b49e1f21ab0af763e2875614a5105bad4fb2f6 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 14 Oct 2025 22:46:31 +0800
Subject: [PATCH v14] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
---
doc/src/sgml/high-availability.sgml | 54 ++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 24 +-
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 ++++++++++++++++
src/backend/parser/gram.y | 33 ++-
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
src/include/access/xlogwait.h | 3 +-
src/include/commands/wait.h | 22 ++
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 293 ++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 3 +
23 files changed, 948 insertions(+), 14 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..f848ac8a77d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+ WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 621f790bbdb..c5d269d6e06 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -364,9 +364,10 @@ WaitLSNCleanup(void)
* or replica got promoted before the target LSN replayed.
*/
WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
{
XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
/* Shouldn't be called when shmem isn't initialized */
@@ -375,6 +376,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
/* Should have a valid proc number */
Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+ if (timeout > 0) {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
/*
* Add our process to the replay waiters heap. It might happen that
* target LSN gets replayed before we do. Another check at the beginning
@@ -385,6 +391,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
for (;;)
{
int rc;
+ long delay_ms = 0;
currentLSN = GetXLogReplayRecPtr(NULL);
/* Recheck that recovery is still in-progress */
@@ -407,9 +414,16 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
break;
}
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
CHECK_FOR_INTERRUPTS();
- rc = WaitLatch(MyLatch, wake_events, -1,
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
/*
@@ -433,6 +447,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
*/
deleteLSNWaiter(WAIT_LSN_REPLAY);
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
return WAIT_LSN_RESULT_SUCCESS;
}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round
+ * into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * WAIT FOR is a command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 21caf2d43bf..1d016df1f6b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ opt_with '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 441bf475b4d..2e33a1d22d0 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
* wait */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
} WaitLSNResult;
/*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
extern void WaitForLSNFlush(XLogRecPtr targetLSN);
#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index dc09d1a3f03..c741099e186 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
WalCloseMethod
WalCompression
--
2.51.0
v14-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchapplication/octet-stream; name=v14-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchDownload
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v14 1/3] Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
src/include/lib/pairingheap.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
--
2.51.0
v14-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v14-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 645e19b2d0d522c16eb731da527baf18f73a7ec2 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 14 Oct 2025 22:12:23 +0800
Subject: [PATCH v14 2/3] Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xlogwait.c | 503 ++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
.../utils/activity/wait_event_names.txt | 3 +
src/include/access/xlogwait.h | 112 ++++
src/include/storage/lwlocklist.h | 1 +
7 files changed, 625 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/include/access/xlogwait.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..621f790bbdb
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for WAL operations to reach specific LSNs
+ * on both physical standby and primary servers. The core idea is simple:
+ * every process that wants to wait publishes the LSN it needs to the
+ * shared memory, and the appropriate process (startup on standby, or
+ * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by the appropriate
+ * process, standby is promoted, or the postmaster dies. Then, it cleans
+ * information about itself in the shared memory.
+ *
+ * On standby servers: After replaying a WAL record, the startup process
+ * first performs a fast path check minWaitedLSN > replayLSN. If this
+ * check is negative, it checks waitersHeap and wakes up the backend
+ * whose awaited LSNs are reached.
+ *
+ * On primary servers: After flushing WAL, the WAL writer or backend
+ * process performs a similar check against the flush LSN and wakes up
+ * waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+static int waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ /* Initialize replay heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+ /* Initialize flush heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+ /* Initialize process info array */
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+ }
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ Assert(!procInfo->inReplayHeap);
+ pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ Assert(!procInfo->inFlushHeap);
+ pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+ {
+ pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+ {
+ pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+ int i;
+ pairingheap *heap;
+
+ /* Select appropriate heap */
+ heap = (operation == WAIT_LSN_REPLAY) ?
+ &waitLSNState->replayWaitersHeap :
+ &waitLSNState->flushWaitersHeap;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the waiters heap until we find LSN not yet reached.
+ * Record process numbers to wake up, but send wakeups after releasing lock.
+ */
+ while (!pairingheap_is_empty(heap))
+ {
+ pairingheap_node *node = pairingheap_first(heap);
+ WaitLSNProcInfo *procInfo;
+
+ /* Get procInfo using appropriate heap node */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ else
+ procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(heap);
+
+ /* Update appropriate flag */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo->inReplayHeap = false;
+ else
+ procInfo->inFlushHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN(operation);
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already reached.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ } while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+ if (waitLSNState)
+ {
+ /*
+ * We do a fast-path check of the heap flags without the lock. These
+ * flags are set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+ if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+ }
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ /*
+ * Add our process to the replay waiters heap. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+ for (;;)
+ {
+ int rc;
+ currentLSN = GetXLogReplayRecPtr(NULL);
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, -1,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory replay heap. We might
+ * already be deleted by the startup process. The 'inReplayHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+ /* We can only wait for flush when we are not in recovery */
+ Assert(!RecoveryInProgress());
+
+ /* Quick exit if already flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ return;
+
+ /* Add to flush waiters */
+ addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+ /* Wait loop */
+ for (;;)
+ {
+ int rc;
+
+ /* Check if the waited LSN has been flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, -1,
+ WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN flush")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory flush heap. We might
+ * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+ return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +357,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..441bf475b4d
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "postgres.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+ WAIT_LSN_REPLAY, /* Waiting for replay on standby */
+ WAIT_LSN_FLUSH /* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations. An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is reached */
+ ProcNumber procno;
+
+ /* Type-safe heap membership flags */
+ bool inReplayHeap; /* In replay waiters heap */
+ bool inFlushHeap; /* In flush waiters heap */
+
+ /* Separate heap nodes for type safety */
+ pairingheap_node replayHeapNode;
+ pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum replay LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedReplayLSN;
+
+ /*
+ * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap replayWaitersHeap;
+
+ /*
+ * The minimum flush LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after flushing
+ * WAL. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedFlushLSN;
+
+ /*
+ * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap flushWaitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
--
2.51.0
Hi,
On Wed, Oct 15, 2025 at 8:23 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Tue, Oct 14, 2025 at 9:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Sat, Oct 4, 2025 at 9:35 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Álvaro,
Thanks for your review.
On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Sep-15, Alexander Korotkov wrote:
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something likeWAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.Shall make LSN an unreserved keyword as well.
Here's the updated v11. Many thanks Jian for off-list discussions and review.
v12 removed unused +WaitStmt +WaitStmtParam in pgindent/typedefs.list.Hi, I’ve split the patch into multiple patch sets for easier review,
per Michael’s advice [1].Patch 2 in v13 is corrupted and patch 3 has an error. Sorry for the
noise. Here's v14.
Made minor changes to #include of xlogwait.h in patch2 to calm CF-bots down.
Best,
Xuneng
Attachments:
v15-0001-Add-pairingheap_initialize-for-shared-memory-usag copy.patchapplication/octet-stream; name="v15-0001-Add-pairingheap_initialize-for-shared-memory-usag copy.patch"Download
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v15 1/3] Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
src/include/lib/pairingheap.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
--
2.51.0
v15-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v15-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 39857e15fac0a7b5b3105b730db4dfb271788cca Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 15:47:27 +0800
Subject: [PATCH v15] Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xlogwait.c | 503 ++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
.../utils/activity/wait_event_names.txt | 3 +
src/include/access/xlogwait.h | 112 ++++
src/include/storage/lwlocklist.h | 1 +
7 files changed, 625 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/include/access/xlogwait.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..49dae7ac1c4
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for WAL operations to reach specific LSNs
+ * on both physical standby and primary servers. The core idea is simple:
+ * every process that wants to wait publishes the LSN it needs to the
+ * shared memory, and the appropriate process (startup on standby, or
+ * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by the appropriate
+ * process, standby is promoted, or the postmaster dies. Then, it cleans
+ * information about itself in the shared memory.
+ *
+ * On standby servers: After replaying a WAL record, the startup process
+ * first performs a fast path check minWaitedLSN > replayLSN. If this
+ * check is negative, it checks waitersHeap and wakes up the backend
+ * whose awaited LSNs are reached.
+ *
+ * On primary servers: After flushing WAL, the WAL writer or backend
+ * process performs a similar check against the flush LSN and wakes up
+ * waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+static int waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ /* Initialize replay heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+ /* Initialize flush heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+ /* Initialize process info array */
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+ }
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ Assert(!procInfo->inReplayHeap);
+ pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ Assert(!procInfo->inFlushHeap);
+ pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+ {
+ pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+ {
+ pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+ int i;
+ pairingheap *heap;
+
+ /* Select appropriate heap */
+ heap = (operation == WAIT_LSN_REPLAY) ?
+ &waitLSNState->replayWaitersHeap :
+ &waitLSNState->flushWaitersHeap;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the waiters heap until we find LSN not yet reached.
+ * Record process numbers to wake up, but send wakeups after releasing lock.
+ */
+ while (!pairingheap_is_empty(heap))
+ {
+ pairingheap_node *node = pairingheap_first(heap);
+ WaitLSNProcInfo *procInfo;
+
+ /* Get procInfo using appropriate heap node */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ else
+ procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(heap);
+
+ /* Update appropriate flag */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo->inReplayHeap = false;
+ else
+ procInfo->inFlushHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN(operation);
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already reached.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ } while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+ if (waitLSNState)
+ {
+ /*
+ * We do a fast-path check of the heap flags without the lock. These
+ * flags are set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+ if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+ }
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ /*
+ * Add our process to the replay waiters heap. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+ for (;;)
+ {
+ int rc;
+ currentLSN = GetXLogReplayRecPtr(NULL);
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, -1,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory replay heap. We might
+ * already be deleted by the startup process. The 'inReplayHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+ /* We can only wait for flush when we are not in recovery */
+ Assert(!RecoveryInProgress());
+
+ /* Quick exit if already flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ return;
+
+ /* Add to flush waiters */
+ addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+ /* Wait loop */
+ for (;;)
+ {
+ int rc;
+
+ /* Check if the waited LSN has been flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, -1,
+ WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN flush")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory flush heap. We might
+ * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+ return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +357,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..ada2a460ca4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+ WAIT_LSN_REPLAY, /* Waiting for replay on standby */
+ WAIT_LSN_FLUSH /* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations. An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is reached */
+ ProcNumber procno;
+
+ /* Type-safe heap membership flags */
+ bool inReplayHeap; /* In replay waiters heap */
+ bool inFlushHeap; /* In flush waiters heap */
+
+ /* Separate heap nodes for type safety */
+ pairingheap_node replayHeapNode;
+ pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum replay LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedReplayLSN;
+
+ /*
+ * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap replayWaitersHeap;
+
+ /*
+ * The minimum flush LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after flushing
+ * WAL. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedFlushLSN;
+
+ /*
+ * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap flushWaitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
--
2.51.0
v15-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v15-0003-Implement-WAIT-FOR-command.patchDownload
From 72b1c2063710693b1976268e8be99a74a8533956 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 16:03:49 +0800
Subject: [PATCH v15] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
---
doc/src/sgml/high-availability.sgml | 54 ++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 24 +-
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 ++++++++++++++++
src/backend/parser/gram.y | 33 ++-
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
src/include/access/xlogwait.h | 3 +-
src/include/commands/wait.h | 22 ++
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 293 ++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 3 +
23 files changed, 948 insertions(+), 14 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..8df1f2ab953
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ [WITH] ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..f848ac8a77d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+ WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 49dae7ac1c4..2f5f8eaf583 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -364,9 +364,10 @@ WaitLSNCleanup(void)
* or replica got promoted before the target LSN replayed.
*/
WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
{
XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
/* Shouldn't be called when shmem isn't initialized */
@@ -375,6 +376,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
/* Should have a valid proc number */
Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+ if (timeout > 0) {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
/*
* Add our process to the replay waiters heap. It might happen that
* target LSN gets replayed before we do. Another check at the beginning
@@ -385,6 +391,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
for (;;)
{
int rc;
+ long delay_ms = 0;
currentLSN = GetXLogReplayRecPtr(NULL);
/* Recheck that recovery is still in-progress */
@@ -407,9 +414,16 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
break;
}
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
CHECK_FOR_INTERRUPTS();
- rc = WaitLatch(MyLatch, wake_events, -1,
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
/*
@@ -433,6 +447,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
*/
deleteLSNWaiter(WAIT_LSN_REPLAY);
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
return WAIT_LSN_RESULT_SUCCESS;
}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round
+ * into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * WAIT FOR is a command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dc0c2886674..c9e0738724b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ opt_with '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index ada2a460ca4..28aea61f6a2 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
* wait */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
} WaitLSNResult;
/*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
extern void WaitForLSNFlush(XLogRecPtr targetLSN);
#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4e445fe0cd7..75c41ad0cb9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..62fdc7cd06c
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,293 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
WalCloseMethod
WalCompression
--
2.51.0
I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:
1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';
and I don't see why you need two ways to specify an option list.
So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1]/messages/by-id/202510101352.vvp4p3p2dblu@alvherre.pgsql, where we have to have additional
productions for the optional parenthesized option list.
So why not do just
+opt_wait_with_clause:
+ WITH '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
which keeps options 1 and 3 of the list above.
Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.
[1]: /messages/by-id/202510101352.vvp4p3p2dblu@alvherre.pgsql
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)
Hi,
Thank you for the grammar review and the clear recommendation.
On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';and I don't see why you need two ways to specify an option list.
I agree with this as unnecessary choices are confusing.
So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.So why not do just
+opt_wait_with_clause: + WITH '(' utility_option_list ')' { $$ = $3; } + | /*EMPTY*/ { $$ = NIL; } + ;which keeps options 1 and 3 of the list above.
Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.
Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.
Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.
Best,
Xuneng
Attachments:
v16-0001-Add-pairingheap_initialize-for-shared-memory-usag copy.patchapplication/octet-stream; name="v16-0001-Add-pairingheap_initialize-for-shared-memory-usag copy.patch"Download
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v16 1/3] Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
src/include/lib/pairingheap.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
--
2.51.0
v16-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v16-0003-Implement-WAIT-FOR-command.patchDownload
From 38971b2448786de5f58ba9be088d4e7e8fc11987 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 16:03:49 +0800
Subject: [PATCH v16 3/3] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
---
doc/src/sgml/high-availability.sgml | 54 ++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 24 +-
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 +++++++++++++++
src/backend/parser/gram.y | 33 ++-
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
src/include/access/xlogwait.h | 3 +-
src/include/commands/wait.h | 22 ++
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 301 ++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 3 +
23 files changed, 956 insertions(+), 14 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..f848ac8a77d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+ WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 49dae7ac1c4..2f5f8eaf583 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -364,9 +364,10 @@ WaitLSNCleanup(void)
* or replica got promoted before the target LSN replayed.
*/
WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
{
XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
/* Shouldn't be called when shmem isn't initialized */
@@ -375,6 +376,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
/* Should have a valid proc number */
Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+ if (timeout > 0) {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
/*
* Add our process to the replay waiters heap. It might happen that
* target LSN gets replayed before we do. Another check at the beginning
@@ -385,6 +391,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
for (;;)
{
int rc;
+ long delay_ms = 0;
currentLSN = GetXLogReplayRecPtr(NULL);
/* Recheck that recovery is still in-progress */
@@ -407,9 +414,16 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
break;
}
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
CHECK_FOR_INTERRUPTS();
- rc = WaitLatch(MyLatch, wake_events, -1,
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
/*
@@ -433,6 +447,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
*/
deleteLSNWaiter(WAIT_LSN_REPLAY);
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
return WAIT_LSN_RESULT_SUCCESS;
}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round
+ * into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * WAIT FOR is a command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dc0c2886674..bec885ea73e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ WITH '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index ada2a460ca4..28aea61f6a2 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
* wait */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
} WaitLSNResult;
/*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
extern void WaitForLSNFlush(XLogRecPtr targetLSN);
#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4e445fe0cd7..75c41ad0cb9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..cc709670e09
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,301 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+ stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+ "get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
WalCloseMethod
WalCompression
--
2.51.0
v16-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v16-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 39857e15fac0a7b5b3105b730db4dfb271788cca Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 15:47:27 +0800
Subject: [PATCH v16 2/3] Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xlogwait.c | 503 ++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
.../utils/activity/wait_event_names.txt | 3 +
src/include/access/xlogwait.h | 112 ++++
src/include/storage/lwlocklist.h | 1 +
7 files changed, 625 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/include/access/xlogwait.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..49dae7ac1c4
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for WAL operations to reach specific LSNs
+ * on both physical standby and primary servers. The core idea is simple:
+ * every process that wants to wait publishes the LSN it needs to the
+ * shared memory, and the appropriate process (startup on standby, or
+ * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by the appropriate
+ * process, standby is promoted, or the postmaster dies. Then, it cleans
+ * information about itself in the shared memory.
+ *
+ * On standby servers: After replaying a WAL record, the startup process
+ * first performs a fast path check minWaitedLSN > replayLSN. If this
+ * check is negative, it checks waitersHeap and wakes up the backend
+ * whose awaited LSNs are reached.
+ *
+ * On primary servers: After flushing WAL, the WAL writer or backend
+ * process performs a similar check against the flush LSN and wakes up
+ * waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+static int waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ /* Initialize replay heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+ /* Initialize flush heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+ /* Initialize process info array */
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+ }
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ Assert(!procInfo->inReplayHeap);
+ pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ Assert(!procInfo->inFlushHeap);
+ pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+ {
+ pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+ {
+ pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+ int i;
+ pairingheap *heap;
+
+ /* Select appropriate heap */
+ heap = (operation == WAIT_LSN_REPLAY) ?
+ &waitLSNState->replayWaitersHeap :
+ &waitLSNState->flushWaitersHeap;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the waiters heap until we find LSN not yet reached.
+ * Record process numbers to wake up, but send wakeups after releasing lock.
+ */
+ while (!pairingheap_is_empty(heap))
+ {
+ pairingheap_node *node = pairingheap_first(heap);
+ WaitLSNProcInfo *procInfo;
+
+ /* Get procInfo using appropriate heap node */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ else
+ procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(heap);
+
+ /* Update appropriate flag */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo->inReplayHeap = false;
+ else
+ procInfo->inFlushHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN(operation);
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already reached.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ } while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+ if (waitLSNState)
+ {
+ /*
+ * We do a fast-path check of the heap flags without the lock. These
+ * flags are set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+ if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+ }
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ /*
+ * Add our process to the replay waiters heap. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+ for (;;)
+ {
+ int rc;
+ currentLSN = GetXLogReplayRecPtr(NULL);
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, -1,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory replay heap. We might
+ * already be deleted by the startup process. The 'inReplayHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+ /* We can only wait for flush when we are not in recovery */
+ Assert(!RecoveryInProgress());
+
+ /* Quick exit if already flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ return;
+
+ /* Add to flush waiters */
+ addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+ /* Wait loop */
+ for (;;)
+ {
+ int rc;
+
+ /* Check if the waited LSN has been flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, -1,
+ WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN flush")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory flush heap. We might
+ * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+ return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +357,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..ada2a460ca4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+ WAIT_LSN_REPLAY, /* Waiting for replay on standby */
+ WAIT_LSN_FLUSH /* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations. An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is reached */
+ ProcNumber procno;
+
+ /* Type-safe heap membership flags */
+ bool inReplayHeap; /* In replay waiters heap */
+ bool inFlushHeap; /* In flush waiters heap */
+
+ /* Separate heap nodes for type safety */
+ pairingheap_node replayHeapNode;
+ pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum replay LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedReplayLSN;
+
+ /*
+ * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap replayWaitersHeap;
+
+ /*
+ * The minimum flush LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after flushing
+ * WAL. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedFlushLSN;
+
+ /*
+ * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap flushWaitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
--
2.51.0
Hi,
On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
Thank you for the grammar review and the clear recommendation.
On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';and I don't see why you need two ways to specify an option list.
I agree with this as unnecessary choices are confusing.
So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.So why not do just
+opt_wait_with_clause: + WITH '(' utility_option_list ')' { $$ = $3; } + | /*EMPTY*/ { $$ = NIL; } + ;which keeps options 1 and 3 of the list above.
Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.
The filename of patch 1 is incorrect due to coping. Just correct it.
Best,
Xuneng
Attachments:
v16-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchapplication/octet-stream; name=v16-0001-Add-pairingheap_initialize-for-shared-memory-usag.patchDownload
From 48abb92fb33628f6eba5bbe865b3b19c24fb716d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v16 1/3] Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
src/include/lib/pairingheap.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
--
2.51.0
v16-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v16-0003-Implement-WAIT-FOR-command.patchDownload
From 38971b2448786de5f58ba9be088d4e7e8fc11987 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 16:03:49 +0800
Subject: [PATCH v16 3/3] Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
---
doc/src/sgml/high-availability.sgml | 54 ++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/access/transam/xlogwait.c | 24 +-
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 +++++++++++++++
src/backend/parser/gram.y | 33 ++-
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
src/include/access/xlogwait.h | 3 +-
src/include/commands/wait.h | 22 ++
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 301 ++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 3 +
23 files changed, 956 insertions(+), 14 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..b5e07a724f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeupReplay(InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 52ff4d119e6..f848ac8a77d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN)))
+ WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 49dae7ac1c4..2f5f8eaf583 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -364,9 +364,10 @@ WaitLSNCleanup(void)
* or replica got promoted before the target LSN replayed.
*/
WaitLSNResult
-WaitForLSNReplay(XLogRecPtr targetLSN)
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
{
XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
/* Shouldn't be called when shmem isn't initialized */
@@ -375,6 +376,11 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
/* Should have a valid proc number */
Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+ if (timeout > 0) {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
/*
* Add our process to the replay waiters heap. It might happen that
* target LSN gets replayed before we do. Another check at the beginning
@@ -385,6 +391,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
for (;;)
{
int rc;
+ long delay_ms = 0;
currentLSN = GetXLogReplayRecPtr(NULL);
/* Recheck that recovery is still in-progress */
@@ -407,9 +414,16 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
break;
}
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
CHECK_FOR_INTERRUPTS();
- rc = WaitLatch(MyLatch, wake_events, -1,
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
/*
@@ -433,6 +447,12 @@ WaitForLSNReplay(XLogRecPtr targetLSN)
*/
deleteLSNWaiter(WAIT_LSN_REPLAY);
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
return WAIT_LSN_RESULT_SUCCESS;
}
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..44db2d71164
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round
+ * into range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why
+ * WAIT FOR is a command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSNReplay(lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dc0c2886674..bec885ea73e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ WITH '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index ada2a460ca4..28aea61f6a2 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -27,6 +27,7 @@ typedef enum
WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
* wait */
+ WAIT_LSN_RESULT_TIMEOUT, /* Timeout occurred */
} WaitLSNResult;
/*
@@ -106,7 +107,7 @@ extern void WaitLSNShmemInit(void);
extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
-extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout);
extern void WaitForLSNFlush(XLogRecPtr targetLSN);
#endif /* XLOG_WAIT_H */
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4e445fe0cd7..75c41ad0cb9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..cc709670e09
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,301 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/,
+ "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/, "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+ stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+ "get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok( $stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout", "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..a2c93c0ef4e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3263,6 +3263,9 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNResult
+WaitLSNState
WaitPMResult
WalCloseMethod
WalCompression
--
2.51.0
v16-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v16-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 39857e15fac0a7b5b3105b730db4dfb271788cca Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 15 Oct 2025 15:47:27 +0800
Subject: [PATCH v16 2/3] Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion:
https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xlogwait.c | 503 ++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
.../utils/activity/wait_event_names.txt | 3 +
src/include/access/xlogwait.h | 112 ++++
src/include/storage/lwlocklist.h | 1 +
7 files changed, 625 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/include/access/xlogwait.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..49dae7ac1c4
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for WAL operations to reach specific LSNs
+ * on both physical standby and primary servers. The core idea is simple:
+ * every process that wants to wait publishes the LSN it needs to the
+ * shared memory, and the appropriate process (startup on standby, or
+ * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by the appropriate
+ * process, standby is promoted, or the postmaster dies. Then, it cleans
+ * information about itself in the shared memory.
+ *
+ * On standby servers: After replaying a WAL record, the startup process
+ * first performs a fast path check minWaitedLSN > replayLSN. If this
+ * check is negative, it checks waitersHeap and wakes up the backend
+ * whose awaited LSNs are reached.
+ *
+ * On primary servers: After flushing WAL, the WAL writer or backend
+ * process performs a similar check against the flush LSN and wakes up
+ * waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+static int waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ /* Initialize replay heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedReplayLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->replayWaitersHeap, waitlsn_replay_cmp, (void *)(uintptr_t)WAIT_LSN_REPLAY);
+
+ /* Initialize flush heap and tracking */
+ pg_atomic_init_u64(&waitLSNState->minWaitedFlushLSN, PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->flushWaitersHeap, waitlsn_flush_cmp, (void *)(uintptr_t)WAIT_LSN_FLUSH);
+
+ /* Initialize process info array */
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for replay waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_replay_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, replayHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Comparison function for flush waiters heaps. Waiting processes are
+ * ordered by LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_flush_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, flushHeapNode, b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNOperation operation)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ if (!pairingheap_is_empty(&waitLSNState->replayWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->replayWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedReplayLSN, minWaitedLSN);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ if (!pairingheap_is_empty(&waitLSNState->flushWaitersHeap))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->flushWaitersHeap);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedFlushLSN, minWaitedLSN);
+ }
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ if (operation == WAIT_LSN_REPLAY)
+ {
+ Assert(!procInfo->inReplayHeap);
+ pairingheap_add(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else /* WAIT_LSN_FLUSH */
+ {
+ Assert(!procInfo->inFlushHeap);
+ pairingheap_add(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = true;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNOperation operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (operation == WAIT_LSN_REPLAY && procInfo->inReplayHeap)
+ {
+ pairingheap_remove(&waitLSNState->replayWaitersHeap, &procInfo->replayHeapNode);
+ procInfo->inReplayHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_REPLAY);
+ }
+ else if (operation == WAIT_LSN_FLUSH && procInfo->inFlushHeap)
+ {
+ pairingheap_remove(&waitLSNState->flushWaitersHeap, &procInfo->flushHeapNode);
+ procInfo->inFlushHeap = false;
+ updateMinWaitedLSN(WAIT_LSN_FLUSH);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNOperation operation, XLogRecPtr currentLSN)
+{
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+ int i;
+ pairingheap *heap;
+
+ /* Select appropriate heap */
+ heap = (operation == WAIT_LSN_REPLAY) ?
+ &waitLSNState->replayWaitersHeap :
+ &waitLSNState->flushWaitersHeap;
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the waiters heap until we find LSN not yet reached.
+ * Record process numbers to wake up, but send wakeups after releasing lock.
+ */
+ while (!pairingheap_is_empty(heap))
+ {
+ pairingheap_node *node = pairingheap_first(heap);
+ WaitLSNProcInfo *procInfo;
+
+ /* Get procInfo using appropriate heap node */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo = pairingheap_container(WaitLSNProcInfo, replayHeapNode, node);
+ else
+ procInfo = pairingheap_container(WaitLSNProcInfo, flushHeapNode, node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(heap);
+
+ /* Update appropriate flag */
+ if (operation == WAIT_LSN_REPLAY)
+ procInfo->inReplayHeap = false;
+ else
+ procInfo->inFlushHeap = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN(operation);
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already reached.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ } while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for replay LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupReplay(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_REPLAY, currentLSN);
+}
+
+/*
+ * Wake up processes waiting for flush LSN to reach currentLSN
+ */
+void
+WaitLSNWakeupFlush(XLogRecPtr currentLSN)
+{
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedFlushLSN) > currentLSN)
+ return;
+
+ wakeupWaiters(WAIT_LSN_FLUSH, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+ if (waitLSNState)
+ {
+ /*
+ * We do a fast-path check of the heap flags without the lock. These
+ * flags are set to true only by the process itself. So, it's only possible
+ * to get a false positive. But that will be eliminated by a recheck
+ * inside deleteLSNWaiter().
+ */
+ if (waitLSNState->procInfos[MyProcNumber].inReplayHeap)
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+ if (waitLSNState->procInfos[MyProcNumber].inFlushHeap)
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+ }
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was replayed.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN replayed.
+ */
+WaitLSNResult
+WaitForLSNReplay(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ /*
+ * Add our process to the replay waiters heap. It might happen that
+ * target LSN gets replayed before we do. Another check at the beginning
+ * of the loop below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN, WAIT_LSN_REPLAY);
+
+ for (;;)
+ {
+ int rc;
+ currentLSN = GetXLogReplayRecPtr(NULL);
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * replayed. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been replayed */
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, -1,
+ WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN replay")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory replay heap. We might
+ * already be deleted by the startup process. The 'inReplayHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_REPLAY);
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
+
+/*
+ * Wait until targetLSN has been flushed on a primary server.
+ * Returns only after the condition is satisfied or on FATAL exit.
+ */
+void
+WaitForLSNFlush(XLogRecPtr targetLSN)
+{
+ XLogRecPtr currentLSN;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends + NUM_AUXILIARY_PROCS);
+
+ /* We can only wait for flush when we are not in recovery */
+ Assert(!RecoveryInProgress());
+
+ /* Quick exit if already flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ return;
+
+ /* Add to flush waiters */
+ addLSNWaiter(targetLSN, WAIT_LSN_FLUSH);
+
+ /* Wait loop */
+ for (;;)
+ {
+ int rc;
+
+ /* Check if the waited LSN has been flushed */
+ currentLSN = GetFlushRecPtr(NULL);
+ if (targetLSN <= currentLSN)
+ break;
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, -1,
+ WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN flush")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory flush heap. We might
+ * already be deleted by the waker process. The 'inFlushHeap' flag prevents
+ * us from the double deletion.
+ */
+ deleteLSNWaiter(WAIT_LSN_FLUSH);
+
+ return;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +357,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..ada2a460ca4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+} WaitLSNResult;
+
+/*
+ * Wait operation types for LSN waiting facility.
+ */
+typedef enum WaitLSNOperation
+{
+ WAIT_LSN_REPLAY, /* Waiting for replay on standby */
+ WAIT_LSN_FLUSH /* Waiting for flush on primary */
+} WaitLSNOperation;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations. An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is reached */
+ ProcNumber procno;
+
+ /* Type-safe heap membership flags */
+ bool inReplayHeap; /* In replay waiters heap */
+ bool inFlushHeap; /* In flush waiters heap */
+
+ /* Separate heap nodes for type safety */
+ pairingheap_node replayHeapNode;
+ pairingheap_node flushHeapNode;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum replay LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedReplayLSN;
+
+ /*
+ * A pairing heap of replay waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap replayWaitersHeap;
+
+ /*
+ * The minimum flush LSN value some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after flushing
+ * WAL. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedFlushLSN;
+
+ /*
+ * A pairing heap of flush waiting processes ordered by LSN values (least LSN is
+ * on top). Protected by WaitLSNLock.
+ */
+ pairingheap flushWaitersHeap;
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeupReplay(XLogRecPtr currentLSN);
+extern void WaitLSNWakeupFlush(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSNReplay(XLogRecPtr targetLSN);
+extern void WaitForLSNFlush(XLogRecPtr targetLSN);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
--
2.51.0
Hi!
In Thu, Oct 16, 2025 at 10:12 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
Thank you for the grammar review and the clear recommendation.
On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';and I don't see why you need two ways to specify an option list.
I agree with this as unnecessary choices are confusing.
So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.So why not do just
+opt_wait_with_clause: + WITH '(' utility_option_list ')' { $$ = $3; } + | /*EMPTY*/ { $$ = NIL; } + ;which keeps options 1 and 3 of the list above.
Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.The filename of patch 1 is incorrect due to coping. Just correct it.
Thank you for rebasing the patch.
I've revised it. The significant changes has been made to 0002, where
I reduced the code duplication. Also, I run pgindent and pgperltidy
and made other small improvements.
Please, check.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v17-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchapplication/octet-stream; name=v17-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchDownload
From 18a1a51c7f7a1bedb23169bbbe8974a9f803b82a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Thu, 9 Oct 2025 10:29:05 +0800
Subject: [PATCH v17 1/3] Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
src/include/lib/pairingheap.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
--
2.39.5 (Apple Git-154)
v17-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v17-0003-Implement-WAIT-FOR-command.patchDownload
From a5db333b5b5b9e0c0c27f6f2bfbad8c4cf327f9b Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 23 Oct 2025 12:47:02 +0300
Subject: [PATCH v17 3/3] Implement WAIT FOR command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
---
doc/src/sgml/high-availability.sgml | 54 ++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 +++++++++++++++
src/backend/parser/gram.y | 33 ++-
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
src/include/commands/wait.h | 22 ++
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 302 ++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 +
21 files changed, 931 insertions(+), 11 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..b3fafb8b48c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ RESULT STATUS
+---------------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..7c3a3541221 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6225,6 +6226,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..0e51148110f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..67068a92dbf
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round into
+ * range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why WAIT FOR is a
+ * command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dc0c2886674..bec885ea73e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16453,6 +16454,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ WITH '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17940,6 +17961,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18110,6 +18132,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18556,6 +18579,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18767,6 +18791,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 08791b8f75e..07b2e2fa67b 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1163,10 +1163,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1743,7 +1744,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4e445fe0cd7..75c41ad0cb9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4384,4 +4384,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..9796a36a2f6
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,302 @@
+# Checks waiting for the lsn replay on standby using
+# WAIT FOR procedure.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+ "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# Test parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/, "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/,
+ "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+ stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+ "get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok($stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success", "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout",
+ "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok( $stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 5. Also, check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 6. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 38d346a3691..d92cb2e6a71 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3269,6 +3269,7 @@ WaitLSNState
WaitLSNProcInfo
WaitLSNResult
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5 (Apple Git-154)
v17-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v17-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 2d3e55c71e69e3cf39be10e42a57ad03ebc28217 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 23 Oct 2025 11:58:17 +0300
Subject: [PATCH v17 2/3] Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
---
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xlogwait.c | 409 ++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
.../utils/activity/wait_event_names.txt | 3 +
src/include/access/xlogwait.h | 98 +++++
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 4 +
8 files changed, 521 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/include/access/xlogwait.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..8276c2f0947
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for WAL operations to reach specific LSNs
+ * on both physical standby and primary servers. The core idea is simple:
+ * every process that wants to wait publishes the LSN it needs to the
+ * shared memory, and the appropriate process (startup on standby, or
+ * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch before it wakens up by the appropriate
+ * process, standby is promoted, or the postmaster dies. Then, it cleans
+ * information about itself in the shared memory.
+ *
+ * On standby servers: After replaying a WAL record, the startup process
+ * first performs a fast path check minWaitedLSN > replayLSN. If this
+ * check is negative, it checks waitersHeap and wakes up the backend
+ * whose awaited LSNs are reached.
+ *
+ * On primary servers: After flushing WAL, the WAL writer or backend
+ * process performs a similar check against the flush LSN and wakes up
+ * waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ int i;
+
+ /* Initialize heaps and tracking */
+ for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+ }
+
+ /* Initialize process info array */
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for LSN waiters heaps. Waiting processes are ordered by
+ * LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ int i = (uintptr_t) arg;
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNType operation)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN[i], minWaitedLSN);
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNType operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ Assert(!procInfo->inHeap[i]);
+ pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+ procInfo->inHeap[i] = true;
+ updateMinWaitedLSN(operation);
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNType operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (procInfo->inHeap[i])
+ {
+ pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+ procInfo->inHeap[i] = false;
+ updateMinWaitedLSN(operation);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the waiters heap until we find LSN not yet reached. Record
+ * process numbers to wake up, but send wakeups after releasing lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+ WaitLSNProcInfo *procInfo;
+
+ /* Get procInfo using appropriate heap node */
+ procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
+
+ /* Update appropriate flag */
+ procInfo->inHeap[i] = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN(operation);
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes, whose waited LSNs are already reached.
+ * As the time consuming operations, we do this outside of
+ * WaitLSNLock. This is actually fine because procLatch isn't ever
+ * freed, so we just can potentially set the wrong process' (or no
+ * process') latch.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ } while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for LSN to reach currentLSN
+ */
+void
+WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+ return;
+
+ wakeupWaiters(operation, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+ if (waitLSNState)
+ {
+ int i;
+
+ /*
+ * We do a fast-path check of the heap flags without the lock. These
+ * flags are set to true only by the process itself. So, it's only
+ * possible to get a false positive. But that will be eliminated by a
+ * recheck inside deleteLSNWaiter().
+ */
+
+ for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
+ {
+ if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
+ deleteLSNWaiter((WaitLSNType) i);
+ }
+ }
+}
+
+/*
+ * Wait using MyLatch till the given LSN is reached, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was reached.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN reached.
+ */
+WaitLSNResult
+WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the waiters heap. It might happen that target LSN
+ * gets reached before we do. Another check at the beginning of the loop
+ * below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN, operation);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = -1;
+
+ if (operation == WAIT_LSN_TYPE_REPLAY)
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ else
+ currentLSN = GetFlushRecPtr(NULL);
+
+ /* Recheck that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but recheck if target LSN was already
+ * reached. See the comment regarding deleteLSNWaiter() below.
+ */
+ deleteLSNWaiter(operation);
+
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been reached */
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ (operation == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ (errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN")));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory heap. We might already be
+ * deleted by the startup process. The 'inHeap' flags prevents us from
+ * the double deletion.
+ */
+ deleteLSNWaiter(operation);
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +357,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..d7aad6d8be4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+ WAIT_LSN_RESULT_TIMEOUT /* Timeout occurred */
+} WaitLSNResult;
+
+/*
+ * LSN type for waiting facility.
+ */
+typedef enum WaitLSNType
+{
+ WAIT_LSN_TYPE_REPLAY = 0, /* Waiting for replay on standby */
+ WAIT_LSN_TYPE_FLUSH = 1, /* Waiting for flush on primary */
+ WAIT_LSN_TYPE_COUNT = 2
+} WaitLSNType;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations. An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is reached */
+ ProcNumber procno;
+
+ /* Heap membership flags for LSN types */
+ bool inHeap[WAIT_LSN_TYPE_COUNT];
+
+ /* Heap nodes for LSN types */
+ pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN values some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN[WAIT_LSN_TYPE_COUNT];
+
+ /*
+ * A pairing heaps of waiting processes ordered by LSN values (least LSN
+ * is on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap[WAIT_LSN_TYPE_COUNT];
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN,
+ int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 377a7946585..38d346a3691 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3264,6 +3264,10 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNType
+WaitLSNState
+WaitLSNProcInfo
+WaitLSNResult
WaitPMResult
WalCloseMethod
WalCompression
--
2.39.5 (Apple Git-154)
Hi,
On Thu, Oct 23, 2025 at 6:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi!
In Thu, Oct 16, 2025 at 10:12 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
Thank you for the grammar review and the clear recommendation.
On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';and I don't see why you need two ways to specify an option list.
I agree with this as unnecessary choices are confusing.
So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.So why not do just
+opt_wait_with_clause: + WITH '(' utility_option_list ')' { $$ = $3; } + | /*EMPTY*/ { $$ = NIL; } + ;which keeps options 1 and 3 of the list above.
Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.The filename of patch 1 is incorrect due to coping. Just correct it.
Thank you for rebasing the patch.
I've revised it. The significant changes has been made to 0002, where
I reduced the code duplication. Also, I run pgindent and pgperltidy
and made other small improvements.
Please, check.
Thanks for updating the patch set!
Patch 2 looks more elegant after the revision. I’ll review them soon.
Best,
Xuneng
Hi, Alexander!
On Thu, Oct 23, 2025 at 8:58 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Thu, Oct 23, 2025 at 6:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi!
In Thu, Oct 16, 2025 at 10:12 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
Thank you for the grammar review and the clear recommendation.
On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';and I don't see why you need two ways to specify an option list.
I agree with this as unnecessary choices are confusing.
So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.So why not do just
+opt_wait_with_clause: + WITH '(' utility_option_list ')' { $$ = $3; } + | /*EMPTY*/ { $$ = NIL; } + ;which keeps options 1 and 3 of the list above.
Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.The filename of patch 1 is incorrect due to coping. Just correct it.
Thank you for rebasing the patch.
I've revised it. The significant changes has been made to 0002, where
I reduced the code duplication. Also, I run pgindent and pgperltidy
and made other small improvements.
Please, check.Thanks for updating the patch set!
Patch 2 looks more elegant after the revision. I’ll review them soon.
I’ve made a few minor updates to the comments and docs in patches 2
and 3. The patch set LGTM now.
Best,
Xuneng
Attachments:
v18-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchapplication/octet-stream; name=v18-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchDownload
From b0ee110622dacd2d4769da6915580e9c3220c09f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:56:53 +0800
Subject: [PATCH v18 1/3] Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
src/include/lib/pairingheap.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
--
2.51.0
v18-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v18-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From b611a90989aec7695349e47fd1fb89d7dd9b1872 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:59:42 +0800
Subject: [PATCH v18 2/3] Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xlogwait.c | 409 ++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
.../utils/activity/wait_event_names.txt | 3 +
src/include/access/xlogwait.h | 98 +++++
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 5 +
8 files changed, 522 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/include/access/xlogwait.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..1f4b38a5114
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for WAL operations to reach specific LSNs
+ * on both physical standby and primary servers. The core idea is simple:
+ * every process that wants to wait publishes the LSN it needs to the
+ * shared memory, and the appropriate process (startup on standby, or
+ * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch until it is woken up by the appropriate
+ * process, standby is promoted, or the postmaster dies. Then, it cleans
+ * information about itself in the shared memory.
+ *
+ * On standby servers: After replaying a WAL record, the startup process
+ * first performs a fast path check minWaitedLSN > replayLSN. If this
+ * check is negative, it checks waitersHeap and wakes up the backend
+ * whose awaited LSNs are reached.
+ *
+ * On primary servers: After flushing WAL, the WAL writer or backend
+ * process performs a similar check against the flush LSN and wakes up
+ * waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ int i;
+
+ /* Initialize heaps and tracking */
+ for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+ }
+
+ /* Initialize process info array */
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for LSN waiters heaps. Waiting processes are ordered by
+ * LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ int i = (uintptr_t) arg;
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNType operation)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN[i], minWaitedLSN);
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNType operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ Assert(!procInfo->inHeap[i]);
+ pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+ procInfo->inHeap[i] = true;
+ updateMinWaitedLSN(operation);
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNType operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (procInfo->inHeap[i])
+ {
+ pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+ procInfo->inHeap[i] = false;
+ updateMinWaitedLSN(operation);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the waiters heap until we find LSN not yet reached. Record
+ * process numbers to wake up, but send wakeups after releasing lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+ WaitLSNProcInfo *procInfo;
+
+ /* Get procInfo using appropriate heap node */
+ procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
+
+ /* Update appropriate flag */
+ procInfo->inHeap[i] = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN(operation);
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes whose waited LSNs have been reached.
+ * Since SetLatch() is a time-consuming operation, we do this outside
+ * of WaitLSNLock. This is safe because procLatch is never freed, so
+ * at worst we may set a latch for the wrong process or for no process
+ * at all, which is harmless.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ } while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for LSN to reach currentLSN
+ */
+void
+WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+ return;
+
+ wakeupWaiters(operation, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+ if (waitLSNState)
+ {
+ int i;
+
+ /*
+ * We do a fast-path check of the heap flags without the lock. These
+ * flags are set to true only by the process itself. So, it's only
+ * possible to get a false positive. But that will be eliminated by a
+ * recheck inside deleteLSNWaiter().
+ */
+
+ for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
+ {
+ if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
+ deleteLSNWaiter((WaitLSNType) i);
+ }
+ }
+}
+
+/*
+ * Wait using MyLatch till the given LSN is reached, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was reached.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN reached.
+ */
+WaitLSNResult
+WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the waiters heap. It might happen that target LSN
+ * gets reached before we do. The check at the beginning of the loop
+ * below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN, operation);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = -1;
+
+ if (operation == WAIT_LSN_TYPE_REPLAY)
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ else
+ currentLSN = GetFlushRecPtr(NULL);
+
+ /* Check that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but check if target LSN was already
+ * reached.
+ */
+ deleteLSNWaiter(operation);
+
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been reached */
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ (operation == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN"));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory heap. We might already be
+ * deleted by the startup process. The 'inHeap' flags prevents us from
+ * the double deletion.
+ */
+ deleteLSNWaiter(operation);
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +357,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..d7aad6d8be4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+ WAIT_LSN_RESULT_TIMEOUT /* Timeout occurred */
+} WaitLSNResult;
+
+/*
+ * LSN type for waiting facility.
+ */
+typedef enum WaitLSNType
+{
+ WAIT_LSN_TYPE_REPLAY = 0, /* Waiting for replay on standby */
+ WAIT_LSN_TYPE_FLUSH = 1, /* Waiting for flush on primary */
+ WAIT_LSN_TYPE_COUNT = 2
+} WaitLSNType;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations. An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is reached */
+ ProcNumber procno;
+
+ /* Heap membership flags for LSN types */
+ bool inHeap[WAIT_LSN_TYPE_COUNT];
+
+ /* Heap nodes for LSN types */
+ pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN values some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN[WAIT_LSN_TYPE_COUNT];
+
+ /*
+ * A pairing heaps of waiting processes ordered by LSN values (least LSN
+ * is on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap[WAIT_LSN_TYPE_COUNT];
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN,
+ int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..e34dcf97df8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3265,7 +3265,12 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNType
+WaitLSNState
+WaitLSNProcInfo
+WaitLSNResult
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.51.0
v18-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v18-0003-Implement-WAIT-FOR-command.patchDownload
From f009c6a8bd305b50889366877cc7a8581fb40157 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 12:03:13 +0800
Subject: [PATCH v18 3/3] Implement WAIT FOR command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
doc/src/sgml/high-availability.sgml | 54 ++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 +++++++++++++++
src/backend/parser/gram.y | 33 ++-
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
src/include/commands/wait.h | 22 ++
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 302 ++++++++++++++++++++++
20 files changed, 930 insertions(+), 11 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..742deb037b7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd91bcd68ec..45a16bd1ec2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6227,6 +6228,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..0e51148110f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..67068a92dbf
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round into
+ * range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why WAIT FOR is a
+ * command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a4b29c822e8..a4e6f80504b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16462,6 +16463,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ WITH '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17949,6 +17970,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18119,6 +18141,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18565,6 +18588,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18776,6 +18800,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 74179139fa9..fde78c55160 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1158,10 +1158,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1738,7 +1739,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ecbddd12e1b..d14294a4ece 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,4 +4385,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..e0ddb06a2f0
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,302 @@
+# Checks waiting for the LSN replay on standby using
+# the WAIT FOR command.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+ "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Check parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/, "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/,
+ "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+ stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+ "get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok($stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success", "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout",
+ "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok( $stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 7. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
--
2.51.0
Hi,
On Sun, Nov 2, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi, Alexander!
On Thu, Oct 23, 2025 at 8:58 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Thu, Oct 23, 2025 at 6:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi!
In Thu, Oct 16, 2025 at 10:12 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Oct 15, 2025 at 8:48 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
Thank you for the grammar review and the clear recommendation.
On Wed, Oct 15, 2025 at 4:51 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
I didn't review the patch other than look at the grammar, but I disagree
with using opt_with in it. I think WITH should be a mandatory word, or
just not be there at all. The current formulation lets you do one of:1. WAIT FOR LSN '123/456' WITH (opt = val);
2. WAIT FOR LSN '123/456' (opt = val);
3. WAIT FOR LSN '123/456';and I don't see why you need two ways to specify an option list.
I agree with this as unnecessary choices are confusing.
So one option is to remove opt_wait_with_clause and just use
opt_utility_option_list, which would remove the WITH keyword from there
(ie. only keep 2 and 3 from the above list). But I think that's worse:
just look at the REPACK grammar[1], where we have to have additional
productions for the optional parenthesized option list.So why not do just
+opt_wait_with_clause: + WITH '(' utility_option_list ')' { $$ = $3; } + | /*EMPTY*/ { $$ = NIL; } + ;which keeps options 1 and 3 of the list above.
Your suggested approach of making WITH mandatory when options are
present looks better.
I've implemented the change as you recommended. Please see patch 3 in v16.Note: you don't need to worry about WITH_LA, because that's only going
to show up when the user writes WITH TIME or WITH ORDINALITY (see
parser.c), and that's a syntax error anyway.Yeah, we require '(' immediately after WITH in our grammar, the
lookahead mechanism will keep it as regular WITH, and any attempt to
write "WITH TIME" or "WITH ORDINALITY" would be a syntax error anyway,
which is expected.The filename of patch 1 is incorrect due to coping. Just correct it.
Thank you for rebasing the patch.
I've revised it. The significant changes has been made to 0002, where
I reduced the code duplication. Also, I run pgindent and pgperltidy
and made other small improvements.
Please, check.Thanks for updating the patch set!
Patch 2 looks more elegant after the revision. I’ll review them soon.I’ve made a few minor updates to the comments and docs in patches 2
and 3. The patch set LGTM now.
Fix an minor issue in v18: WaitStmt was mistakenly added to
pgindent/typedefs.list in patch 2, but it should belong to patch 3.
Best,
Xuneng
Attachments:
v19-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v19-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 65d6c1d497389925961738207422cd2bc69c95bd Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:59:42 +0800
Subject: [PATCH v19 2/3] Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xlogwait.c | 409 ++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
.../utils/activity/wait_event_names.txt | 3 +
src/include/access/xlogwait.h | 98 +++++
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 4 +
8 files changed, 521 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/include/access/xlogwait.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..1f4b38a5114
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for WAL operations to reach specific LSNs
+ * on both physical standby and primary servers. The core idea is simple:
+ * every process that wants to wait publishes the LSN it needs to the
+ * shared memory, and the appropriate process (startup on standby, or
+ * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch until it is woken up by the appropriate
+ * process, standby is promoted, or the postmaster dies. Then, it cleans
+ * information about itself in the shared memory.
+ *
+ * On standby servers: After replaying a WAL record, the startup process
+ * first performs a fast path check minWaitedLSN > replayLSN. If this
+ * check is negative, it checks waitersHeap and wakes up the backend
+ * whose awaited LSNs are reached.
+ *
+ * On primary servers: After flushing WAL, the WAL writer or backend
+ * process performs a similar check against the flush LSN and wakes up
+ * waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ int i;
+
+ /* Initialize heaps and tracking */
+ for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+ }
+
+ /* Initialize process info array */
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for LSN waiters heaps. Waiting processes are ordered by
+ * LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ int i = (uintptr_t) arg;
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified operation type
+ */
+static void
+updateMinWaitedLSN(WaitLSNType operation)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN[i], minWaitedLSN);
+}
+
+/*
+ * Add current process to appropriate waiters heap based on operation type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNType operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ Assert(!procInfo->inHeap[i]);
+ pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+ procInfo->inHeap[i] = true;
+ updateMinWaitedLSN(operation);
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on operation type
+ */
+static void
+deleteLSNWaiter(WaitLSNType operation)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (procInfo->inHeap[i])
+ {
+ pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+ procInfo->inHeap[i] = false;
+ updateMinWaitedLSN(operation);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the waiters heap until we find LSN not yet reached. Record
+ * process numbers to wake up, but send wakeups after releasing lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+ WaitLSNProcInfo *procInfo;
+
+ /* Get procInfo using appropriate heap node */
+ procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
+
+ /* Update appropriate flag */
+ procInfo->inHeap[i] = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN(operation);
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes whose waited LSNs have been reached.
+ * Since SetLatch() is a time-consuming operation, we do this outside
+ * of WaitLSNLock. This is safe because procLatch is never freed, so
+ * at worst we may set a latch for the wrong process or for no process
+ * at all, which is harmless.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ } while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for LSN to reach currentLSN
+ */
+void
+WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN)
+{
+ int i = (int) operation;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+ return;
+
+ wakeupWaiters(operation, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+ if (waitLSNState)
+ {
+ int i;
+
+ /*
+ * We do a fast-path check of the heap flags without the lock. These
+ * flags are set to true only by the process itself. So, it's only
+ * possible to get a false positive. But that will be eliminated by a
+ * recheck inside deleteLSNWaiter().
+ */
+
+ for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
+ {
+ if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
+ deleteLSNWaiter((WaitLSNType) i);
+ }
+ }
+}
+
+/*
+ * Wait using MyLatch till the given LSN is reached, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was reached.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN reached.
+ */
+WaitLSNResult
+WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the waiters heap. It might happen that target LSN
+ * gets reached before we do. The check at the beginning of the loop
+ * below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN, operation);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = -1;
+
+ if (operation == WAIT_LSN_TYPE_REPLAY)
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ else
+ currentLSN = GetFlushRecPtr(NULL);
+
+ /* Check that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but check if target LSN was already
+ * reached.
+ */
+ deleteLSNWaiter(operation);
+
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been reached */
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ (operation == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN"));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory heap. We might already be
+ * deleted by the startup process. The 'inHeap' flags prevents us from
+ * the double deletion.
+ */
+ deleteLSNWaiter(operation);
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +357,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..d7aad6d8be4
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+ WAIT_LSN_RESULT_TIMEOUT /* Timeout occurred */
+} WaitLSNResult;
+
+/*
+ * LSN type for waiting facility.
+ */
+typedef enum WaitLSNType
+{
+ WAIT_LSN_TYPE_REPLAY = 0, /* Waiting for replay on standby */
+ WAIT_LSN_TYPE_FLUSH = 1, /* Waiting for flush on primary */
+ WAIT_LSN_TYPE_COUNT = 2
+} WaitLSNType;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations. An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is reached */
+ ProcNumber procno;
+
+ /* Heap membership flags for LSN types */
+ bool inHeap[WAIT_LSN_TYPE_COUNT];
+
+ /* Heap nodes for LSN types */
+ pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN values some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN[WAIT_LSN_TYPE_COUNT];
+
+ /*
+ * A pairing heaps of waiting processes ordered by LSN values (least LSN
+ * is on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap[WAIT_LSN_TYPE_COUNT];
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(WaitLSNType operation, XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSN(WaitLSNType operation, XLogRecPtr targetLSN,
+ int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..237d33c538c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3265,6 +3265,10 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNType
+WaitLSNState
+WaitLSNProcInfo
+WaitLSNResult
WaitPMResult
WalCloseMethod
WalCompression
--
2.51.0
v19-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchapplication/octet-stream; name=v19-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchDownload
From b0ee110622dacd2d4769da6915580e9c3220c09f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:56:53 +0800
Subject: [PATCH v19 1/3] Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
src/include/lib/pairingheap.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
--
2.51.0
v19-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v19-0003-Implement-WAIT-FOR-command.patchDownload
From b686e6126ac9eb5b54a2782ac8be3539454da49a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Mon, 3 Nov 2025 09:57:30 +0800
Subject: [PATCH v19 3/3] Implement WAIT FOR command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
doc/src/sgml/high-availability.sgml | 54 ++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 +++++++++++++++
src/backend/parser/gram.y | 33 ++-
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
src/include/commands/wait.h | 22 ++
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 302 ++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 +
21 files changed, 931 insertions(+), 11 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..742deb037b7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd91bcd68ec..45a16bd1ec2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6227,6 +6228,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..0e51148110f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..67068a92dbf
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round into
+ * range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why WAIT FOR is a
+ * command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a4b29c822e8..a4e6f80504b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16462,6 +16463,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ WITH '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17949,6 +17970,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18119,6 +18141,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18565,6 +18588,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18776,6 +18800,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 74179139fa9..fde78c55160 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1158,10 +1158,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1738,7 +1739,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ecbddd12e1b..d14294a4ece 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,4 +4385,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..e0ddb06a2f0
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,302 @@
+# Checks waiting for the LSN replay on standby using
+# the WAIT FOR command.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+ "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Check parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/, "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/,
+ "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+ stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+ "get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok($stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success", "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout",
+ "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok( $stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 7. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 237d33c538c..e34dcf97df8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3270,6 +3270,7 @@ WaitLSNState
WaitLSNProcInfo
WaitLSNResult
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.51.0
Hello, Xuneng!
On Mon, Nov 3, 2025 at 4:20 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Sun, Nov 2, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Thu, Oct 23, 2025 at 8:58 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
I’ve made a few minor updates to the comments and docs in patches 2
and 3. The patch set LGTM now.Fix an minor issue in v18: WaitStmt was mistakenly added to
pgindent/typedefs.list in patch 2, but it should belong to patch 3.
Thank you. I also made some minor changes to 0002 renaming
"operation" => "lsnType".
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v20-0002-Add-infrastructure-for-efficient-LSN-waiting.patchapplication/octet-stream; name=v20-0002-Add-infrastructure-for-efficient-LSN-waiting.patchDownload
From 27d57234c169c6612e432bb5ff19acac2c5982d9 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 3 Nov 2025 13:31:13 +0200
Subject: [PATCH v20 2/3] Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.
The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.
Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring
This infrastructure can be used by features that need to wait for WAL
operations to complete.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/xlogwait.c | 409 ++++++++++++++++++
src/backend/storage/ipc/ipci.c | 3 +
.../utils/activity/wait_event_names.txt | 3 +
src/include/access/xlogwait.h | 98 +++++
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 4 +
8 files changed, 521 insertions(+), 1 deletion(-)
create mode 100644 src/backend/access/transam/xlogwait.c
create mode 100644 src/include/access/xlogwait.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..a32f473e0a2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -36,7 +36,8 @@ OBJS = \
xlogreader.o \
xlogrecovery.o \
xlogstats.o \
- xlogutils.o
+ xlogutils.o \
+ xlogwait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index e8ae9b13c8e..74a62ab3eab 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'xlogrecovery.c',
'xlogstats.c',
'xlogutils.c',
+ 'xlogwait.c',
)
# used by frontend programs to build a frontend xlogreader
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
new file mode 100644
index 00000000000..e04567cfd67
--- /dev/null
+++ b/src/backend/access/transam/xlogwait.c
@@ -0,0 +1,409 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.c
+ * Implements waiting for WAL operations to reach specific LSNs.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/access/transam/xlogwait.c
+ *
+ * NOTES
+ * This file implements waiting for WAL operations to reach specific LSNs
+ * on both physical standby and primary servers. The core idea is simple:
+ * every process that wants to wait publishes the LSN it needs to the
+ * shared memory, and the appropriate process (startup on standby, or
+ * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ *
+ * The shared memory used by this module comprises a procInfos
+ * per-backend array with the information of the awaited LSN for each
+ * of the backend processes. The elements of that array are organized
+ * into a pairing heap waitersHeap, which allows for very fast finding
+ * of the least awaited LSN.
+ *
+ * In addition, the least-awaited LSN is cached as minWaitedLSN. The
+ * waiter process publishes information about itself to the shared
+ * memory and waits on the latch until it is woken up by the appropriate
+ * process, standby is promoted, or the postmaster dies. Then, it cleans
+ * information about itself in the shared memory.
+ *
+ * On standby servers: After replaying a WAL record, the startup process
+ * first performs a fast path check minWaitedLSN > replayLSN. If this
+ * check is negative, it checks waitersHeap and wakes up the backend
+ * whose awaited LSNs are reached.
+ *
+ * On primary servers: After flushing WAL, the WAL writer or backend
+ * process performs a similar check against the flush LSN and wakes up
+ * waiters whose target flush LSNs have been reached.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+ void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+ Size size;
+
+ size = offsetof(WaitLSNState, procInfos);
+ size = add_size(size, mul_size(MaxBackends + NUM_AUXILIARY_PROCS, sizeof(WaitLSNProcInfo)));
+ return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+ bool found;
+
+ waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+ WaitLSNShmemSize(),
+ &found);
+ if (!found)
+ {
+ int i;
+
+ /* Initialize heaps and tracking */
+ for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
+ {
+ pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
+ pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+ }
+
+ /* Initialize process info array */
+ memset(&waitLSNState->procInfos, 0,
+ (MaxBackends + NUM_AUXILIARY_PROCS) * sizeof(WaitLSNProcInfo));
+ }
+}
+
+/*
+ * Comparison function for LSN waiters heaps. Waiting processes are ordered by
+ * LSN, so that the waiter with smallest LSN is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+ int i = (uintptr_t) arg;
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+
+ if (aproc->waitLSN < bproc->waitLSN)
+ return 1;
+ else if (aproc->waitLSN > bproc->waitLSN)
+ return -1;
+ else
+ return 0;
+}
+
+/*
+ * Update minimum waited LSN for the specified LSN type
+ */
+static void
+updateMinWaitedLSN(WaitLSNType lsnType)
+{
+ XLogRecPtr minWaitedLSN = PG_UINT64_MAX;
+ int i = (int) lsnType;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+ minWaitedLSN = procInfo->waitLSN;
+ }
+ pg_atomic_write_u64(&waitLSNState->minWaitedLSN[i], minWaitedLSN);
+}
+
+/*
+ * Add current process to appropriate waiters heap based on LSN type
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn, WaitLSNType lsnType)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+ int i = (int) lsnType;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ procInfo->procno = MyProcNumber;
+ procInfo->waitLSN = lsn;
+
+ Assert(!procInfo->inHeap[i]);
+ pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+ procInfo->inHeap[i] = true;
+ updateMinWaitedLSN(lsnType);
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove current process from appropriate waiters heap based on LSN type
+ */
+static void
+deleteLSNWaiter(WaitLSNType lsnType)
+{
+ WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+ int i = (int) lsnType;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ if (procInfo->inHeap[i])
+ {
+ pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
+ procInfo->inHeap[i] = false;
+ updateMinWaitedLSN(lsnType);
+ }
+
+ LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Size of a static array of procs to wakeup by WaitLSNWakeup() allocated
+ * on the stack. It should be enough to take single iteration for most cases.
+ */
+#define WAKEUP_PROC_STATIC_ARRAY_SIZE (16)
+
+/*
+ * Remove waiters whose LSN has been reached from the heap and set their
+ * latches. If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
+ *
+ * This function first accumulates waiters to wake up into an array, then
+ * wakes them up without holding a WaitLSNLock. The array size is static and
+ * equal to WAKEUP_PROC_STATIC_ARRAY_SIZE. That should be more than enough
+ * to wake up all the waiters at once in the vast majority of cases. However,
+ * if there are more waiters, this function will loop to process them in
+ * multiple chunks.
+ */
+static void
+wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
+{
+ ProcNumber wakeUpProcs[WAKEUP_PROC_STATIC_ARRAY_SIZE];
+ int numWakeUpProcs;
+ int i = (int) lsnType;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ do
+ {
+ numWakeUpProcs = 0;
+ LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+ /*
+ * Iterate the waiters heap until we find LSN not yet reached. Record
+ * process numbers to wake up, but send wakeups after releasing lock.
+ */
+ while (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
+ {
+ pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
+ WaitLSNProcInfo *procInfo;
+
+ /* Get procInfo using appropriate heap node */
+ procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+
+ if (!XLogRecPtrIsInvalid(currentLSN) && procInfo->waitLSN > currentLSN)
+ break;
+
+ Assert(numWakeUpProcs < WAKEUP_PROC_STATIC_ARRAY_SIZE);
+ wakeUpProcs[numWakeUpProcs++] = procInfo->procno;
+ (void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
+
+ /* Update appropriate flag */
+ procInfo->inHeap[i] = false;
+
+ if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
+ break;
+ }
+
+ updateMinWaitedLSN(lsnType);
+ LWLockRelease(WaitLSNLock);
+
+ /*
+ * Set latches for processes whose waited LSNs have been reached.
+ * Since SetLatch() is a time-consuming operation, we do this outside
+ * of WaitLSNLock. This is safe because procLatch is never freed, so
+ * at worst we may set a latch for the wrong process or for no process
+ * at all, which is harmless.
+ */
+ for (i = 0; i < numWakeUpProcs; i++)
+ SetLatch(&GetPGProcByNumber(wakeUpProcs[i])->procLatch);
+
+ } while (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE);
+}
+
+/*
+ * Wake up processes waiting for LSN to reach currentLSN
+ */
+void
+WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN)
+{
+ int i = (int) lsnType;
+
+ Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
+
+ /* Fast path check */
+ if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+ return;
+
+ wakeupWaiters(lsnType, currentLSN);
+}
+
+/*
+ * Clean up LSN waiters for exiting process
+ */
+void
+WaitLSNCleanup(void)
+{
+ if (waitLSNState)
+ {
+ int i;
+
+ /*
+ * We do a fast-path check of the heap flags without the lock. These
+ * flags are set to true only by the process itself. So, it's only
+ * possible to get a false positive. But that will be eliminated by a
+ * recheck inside deleteLSNWaiter().
+ */
+
+ for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
+ {
+ if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
+ deleteLSNWaiter((WaitLSNType) i);
+ }
+ }
+}
+
+/*
+ * Wait using MyLatch till the given LSN is reached, the replica gets
+ * promoted, or the postmaster dies.
+ *
+ * Returns WAIT_LSN_RESULT_SUCCESS if target LSN was reached.
+ * Returns WAIT_LSN_RESULT_NOT_IN_RECOVERY if run not in recovery,
+ * or replica got promoted before the target LSN reached.
+ */
+WaitLSNResult
+WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
+{
+ XLogRecPtr currentLSN;
+ TimestampTz endtime = 0;
+ int wake_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;
+
+ /* Shouldn't be called when shmem isn't initialized */
+ Assert(waitLSNState);
+
+ /* Should have a valid proc number */
+ Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+ if (timeout > 0)
+ {
+ endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+ wake_events |= WL_TIMEOUT;
+ }
+
+ /*
+ * Add our process to the waiters heap. It might happen that target LSN
+ * gets reached before we do. The check at the beginning of the loop
+ * below prevents the race condition.
+ */
+ addLSNWaiter(targetLSN, lsnType);
+
+ for (;;)
+ {
+ int rc;
+ long delay_ms = -1;
+
+ if (lsnType == WAIT_LSN_TYPE_REPLAY)
+ currentLSN = GetXLogReplayRecPtr(NULL);
+ else
+ currentLSN = GetFlushRecPtr(NULL);
+
+ /* Check that recovery is still in-progress */
+ if (!RecoveryInProgress())
+ {
+ /*
+ * Recovery was ended, but check if target LSN was already
+ * reached.
+ */
+ deleteLSNWaiter(lsnType);
+
+ if (PromoteIsTriggered() && targetLSN <= currentLSN)
+ return WAIT_LSN_RESULT_SUCCESS;
+ return WAIT_LSN_RESULT_NOT_IN_RECOVERY;
+ }
+ else
+ {
+ /* Check if the waited LSN has been reached */
+ if (targetLSN <= currentLSN)
+ break;
+ }
+
+ if (timeout > 0)
+ {
+ delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+ if (delay_ms <= 0)
+ break;
+ }
+
+ CHECK_FOR_INTERRUPTS();
+
+ rc = WaitLatch(MyLatch, wake_events, delay_ms,
+ (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ ereport(FATAL,
+ errcode(ERRCODE_ADMIN_SHUTDOWN),
+ errmsg("terminating connection due to unexpected postmaster exit"),
+ errcontext("while waiting for LSN"));
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+ }
+
+ /*
+ * Delete our process from the shared memory heap. We might already be
+ * deleted by the startup process. The 'inHeap' flags prevents us from
+ * the double deletion.
+ */
+ deleteLSNWaiter(lsnType);
+
+ /*
+ * If we didn't reach the target LSN, we must be exited by timeout.
+ */
+ if (targetLSN > currentLSN)
+ return WAIT_LSN_RESULT_TIMEOUT;
+
+ return WAIT_LSN_RESULT_SUCCESS;
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..10ffce8d174 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -24,6 +24,7 @@
#include "access/twophase.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "commands/async.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, WaitLSNShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -343,6 +345,7 @@ CreateOrAttachShmemStructs(void)
WaitEventCustomShmemInit();
InjectionPointShmemInit();
AioShmemInit();
+ WaitLSNShmemInit();
}
/*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..c1ac71ff7f2 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,6 +89,8 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
@@ -355,6 +357,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
AioWorkerSubmissionQueue "Waiting to access AIO worker submission queue."
+WaitLSN "Waiting to read or update shared Wait-for-LSN state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
new file mode 100644
index 00000000000..4dc328b1b07
--- /dev/null
+++ b/src/include/access/xlogwait.h
@@ -0,0 +1,98 @@
+/*-------------------------------------------------------------------------
+ *
+ * xlogwait.h
+ * Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/xlogwait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef XLOG_WAIT_H
+#define XLOG_WAIT_H
+
+#include "access/xlogdefs.h"
+#include "lib/pairingheap.h"
+#include "port/atomics.h"
+#include "storage/procnumber.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * Result statuses for WaitForLSNReplay().
+ */
+typedef enum
+{
+ WAIT_LSN_RESULT_SUCCESS, /* Target LSN is reached */
+ WAIT_LSN_RESULT_NOT_IN_RECOVERY, /* Recovery ended before or during our
+ * wait */
+ WAIT_LSN_RESULT_TIMEOUT /* Timeout occurred */
+} WaitLSNResult;
+
+/*
+ * LSN type for waiting facility.
+ */
+typedef enum WaitLSNType
+{
+ WAIT_LSN_TYPE_REPLAY = 0, /* Waiting for replay on standby */
+ WAIT_LSN_TYPE_FLUSH = 1, /* Waiting for flush on primary */
+ WAIT_LSN_TYPE_COUNT = 2
+} WaitLSNType;
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN operations. An item of
+ * waitLSNState->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+ /* LSN, which this process is waiting for */
+ XLogRecPtr waitLSN;
+
+ /* Process to wake up once the waitLSN is reached */
+ ProcNumber procno;
+
+ /* Heap membership flags for LSN types */
+ bool inHeap[WAIT_LSN_TYPE_COUNT];
+
+ /* Heap nodes for LSN types */
+ pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+ /*
+ * The minimum LSN values some process is waiting for. Used for the
+ * fast-path checking if we need to wake up any waiters after replaying a
+ * WAL record. Could be read lock-less. Update protected by WaitLSNLock.
+ */
+ pg_atomic_uint64 minWaitedLSN[WAIT_LSN_TYPE_COUNT];
+
+ /*
+ * A pairing heaps of waiting processes ordered by LSN values (least LSN
+ * is on top). Protected by WaitLSNLock.
+ */
+ pairingheap waitersHeap[WAIT_LSN_TYPE_COUNT];
+
+ /*
+ * An array with per-process information, indexed by the process number.
+ * Protected by WaitLSNLock.
+ */
+ WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
+ int64 timeout);
+
+#endif /* XLOG_WAIT_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..5b0ce383408 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, WaitLSN)
/*
* There also exist several built-in LWLock tranches. As with the predefined
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..237d33c538c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3265,6 +3265,10 @@ WaitEventIO
WaitEventIPC
WaitEventSet
WaitEventTimeout
+WaitLSNType
+WaitLSNState
+WaitLSNProcInfo
+WaitLSNResult
WaitPMResult
WalCloseMethod
WalCompression
--
2.39.5 (Apple Git-154)
v20-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchapplication/octet-stream; name=v20-0001-Add-pairingheap_initialize-for-shared-memory-usa.patchDownload
From 697d5aa28add566198bd1bccce5625bc35e1ea5a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Sun, 2 Nov 2025 11:56:53 +0800
Subject: [PATCH v20 1/3] Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/lib/pairingheap.c | 18 ++++++++++++++++--
src/include/lib/pairingheap.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index 0aef8a88f1b..fa8431f7946 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
pairingheap *heap;
heap = (pairingheap *) palloc(sizeof(pairingheap));
+ pairingheap_initialize(heap, compare, arg);
+
+ return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory. Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+ void *arg)
+{
heap->ph_compare = compare;
heap->ph_arg = arg;
heap->ph_root = NULL;
-
- return heap;
}
/*
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 3c57d3fda1b..567586f2ecf 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+ pairingheap_comparator compare,
+ void *arg);
extern void pairingheap_free(pairingheap *heap);
extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
extern pairingheap_node *pairingheap_first(pairingheap *heap);
--
2.39.5 (Apple Git-154)
v20-0003-Implement-WAIT-FOR-command.patchapplication/octet-stream; name=v20-0003-Implement-WAIT-FOR-command.patchDownload
From 180a09f5d264b6b8ebd0db034ea89413751b23cd Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 3 Nov 2025 13:32:47 +0200
Subject: [PATCH v20 3/3] Implement WAIT FOR command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
WAIT FOR needs to wait without any snapshot held. Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality. It's not possible to implement this as
a function. Previous experience shows that stored procedures also have
limitation in this aspect.
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
doc/src/sgml/high-availability.sgml | 54 ++++
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/ref/wait_for.sgml | 234 +++++++++++++++++
doc/src/sgml/reference.sgml | 1 +
src/backend/access/transam/xact.c | 6 +
src/backend/access/transam/xlog.c | 7 +
src/backend/access/transam/xlogrecovery.c | 11 +
src/backend/commands/Makefile | 3 +-
src/backend/commands/meson.build | 1 +
src/backend/commands/wait.c | 212 +++++++++++++++
src/backend/parser/gram.y | 33 ++-
src/backend/storage/lmgr/proc.c | 6 +
src/backend/tcop/pquery.c | 12 +-
src/backend/tcop/utility.c | 22 ++
src/include/commands/wait.h | 22 ++
src/include/nodes/parsenodes.h | 8 +
src/include/parser/kwlist.h | 2 +
src/include/tcop/cmdtaglist.h | 1 +
src/test/recovery/meson.build | 3 +-
src/test/recovery/t/049_wait_for_lsn.pl | 302 ++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 +
21 files changed, 931 insertions(+), 11 deletions(-)
create mode 100644 doc/src/sgml/ref/wait_for.sgml
create mode 100644 src/backend/commands/wait.c
create mode 100644 src/include/commands/wait.h
create mode 100644 src/test/recovery/t/049_wait_for_lsn.pl
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b47d8b4106e..742deb037b7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1376,6 +1376,60 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</sect3>
</sect2>
+ <sect2 id="read-your-writes-consistency">
+ <title>Read-Your-Writes Consistency</title>
+
+ <para>
+ In asynchronous replication, there is always a short window where changes
+ on the primary may not yet be visible on the standby due to replication
+ lag. This can lead to inconsistencies when an application writes data on
+ the primary and then immediately issues a read query on the standby.
+ However, it is possible to address this without switching to synchronous
+ replication.
+ </para>
+
+ <para>
+ To address this, PostgreSQL offers a mechanism for read-your-writes
+ consistency. The key idea is to ensure that a client sees its own writes
+ by synchronizing the WAL replay on the standby with the known point of
+ change on the primary.
+ </para>
+
+ <para>
+ This is achieved by the following steps. After performing write
+ operations, the application retrieves the current WAL location using a
+ function call like this.
+
+ <programlisting>
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+ </programlisting>
+ </para>
+
+ <para>
+ The <acronym>LSN</acronym> obtained from the primary is then communicated
+ to the standby server. This can be managed at the application level or
+ via the connection pooler. On the standby, the application issues the
+ <xref linkend="sql-wait-for"/> command to block further processing until
+ the standby's WAL replay process reaches (or exceeds) the specified
+ <acronym>LSN</acronym>.
+
+ <programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+ </programlisting>
+ Once the command returns a status of success, it guarantees that all
+ changes up to the provided <acronym>LSN</acronym> have been applied,
+ ensuring that subsequent read queries will reflect the latest updates.
+ </para>
+ </sect2>
+
<sect2 id="continuous-archiving-in-standby">
<title>Continuous Archiving in Standby</title>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..e167406c744 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
<!ENTITY update SYSTEM "update.sgml">
<!ENTITY vacuum SYSTEM "vacuum.sgml">
<!ENTITY values SYSTEM "values.sgml">
+<!ENTITY waitFor SYSTEM "wait_for.sgml">
<!-- applications and utilities -->
<!ENTITY clusterdb SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
new file mode 100644
index 00000000000..3b8e842d1de
--- /dev/null
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -0,0 +1,234 @@
+<!--
+doc/src/sgml/ref/wait_for.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait-for">
+ <indexterm zone="sql-wait-for">
+ <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+ <refentrytitle>WAIT FOR</refentrytitle>
+ <manvolnum>7</manvolnum>
+ <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+ <refname>WAIT FOR</refname>
+ <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+
+ TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
+ NO_THROW
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+ <title>Description</title>
+
+ <para>
+ Waits until recovery replays <parameter>lsn</parameter>.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
+ </para>
+
+ <para>
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ </para>
+ </refsect1>
+
+ <refsect1>
+ <title>Parameters</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><replaceable class="parameter">lsn</replaceable></term>
+ <listitem>
+ <para>
+ Specifies the target <acronym>LSN</acronym> to wait for.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
+ <listitem>
+ <para>
+ This clause specifies optional parameters for the wait operation.
+ The following parameters are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
+ <listitem>
+ <para>
+ When specified and <parameter>timeout</parameter> is greater than zero,
+ the command waits until <parameter>lsn</parameter> is reached or
+ the specified <parameter>timeout</parameter> has elapsed.
+ </para>
+ <para>
+ The <parameter>timeout</parameter> might be given as integer number of
+ milliseconds. Also it might be given as string literal with
+ integer number of milliseconds or a number with unit
+ (see <xref linkend="config-setting-names-values"/>).
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>NO_THROW</literal></term>
+ <listitem>
+ <para>
+ Specify to not throw an error in the case of timeout or
+ running on the primary. In this case the result status can be get from
+ the return value.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Outputs</title>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>success</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that we have successfully reached
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>timeout</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the timeout happened before reaching
+ the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>not in recovery</literal></term>
+ <listitem>
+ <para>
+ This return value denotes that the database server is not in a recovery
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command, or it was promoted before
+ reaching the target <parameter>lsn</parameter>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </refsect1>
+
+ <refsect1>
+ <title>Notes</title>
+
+ <para>
+ <command>WAIT FOR</command> command waits till
+ <parameter>lsn</parameter> to be replayed on standby.
+ That is, after this command execution, the value returned by
+ <function>pg_last_wal_replay_lsn</function> should be greater or equal
+ to the <parameter>lsn</parameter> value. This is useful to achieve
+ read-your-writes-consistency, while using async replica for reads and
+ primary for writes. In that case, the <acronym>lsn</acronym> of the last
+ modification should be stored on the client application side or the
+ connection pooler side.
+ </para>
+
+ <para>
+ <command>WAIT FOR</command> command should be called on standby.
+ If a user runs <command>WAIT FOR</command> on primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
+ However, if <command>WAIT FOR</command> is
+ called on primary promoted from standby and <literal>lsn</literal>
+ was already replayed, then the <command>WAIT FOR</command> command just
+ exits immediately.
+ </para>
+
+</refsect1>
+
+ <refsect1>
+ <title>Examples</title>
+
+ <para>
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
+
+ <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+</programlisting>
+
+ Then an application could run <command>WAIT FOR</command>
+ with the <parameter>lsn</parameter> obtained from primary. After that the
+ changes made on primary should be guaranteed to be visible on replica.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20';
+ status
+--------
+ success
+(1 row)
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, the error is thrown.
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+</programlisting>
+ </para>
+
+ <para>
+ The same example uses <command>WAIT FOR</command> with
+ <parameter>NO_THROW</parameter> option.
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+ status
+--------
+ timeout
+(1 row)
+</programlisting>
+ </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2cf02c37b17 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
&update;
&vacuum;
&values;
+ &waitFor;
</reference>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2cf3d4e92b7..092e197eba3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -31,6 +31,7 @@
#include "access/xloginsert.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "catalog/index.h"
#include "catalog/namespace.h"
#include "catalog/pg_enum.h"
@@ -2843,6 +2844,11 @@ AbortTransaction(void)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Clear wait information and command progress indicator */
pgstat_report_wait_end();
pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd91bcd68ec..45a16bd1ec2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -62,6 +62,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
@@ -6227,6 +6228,12 @@ StartupXLOG(void)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * Wake up all waiters for replay LSN. They need to report an error that
+ * recovery was ended before reaching the target LSN.
+ */
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+
/*
* Shutdown the recovery environment. This must occur after
* RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 3e3c4da01a2..0e51148110f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogreader.h"
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
#include "commands/tablespace.h"
@@ -1838,6 +1839,16 @@ PerformWalRecovery(void)
break;
}
+ /*
+ * If we replayed an LSN that someone was waiting for then walk
+ * over the shared memory array and set latches to notify the
+ * waiters.
+ */
+ if (waitLSNState &&
+ (XLogRecoveryCtl->lastReplayedEndRecPtr >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index cb2fbdc7c60..f99acfd2b4b 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -64,6 +64,7 @@ OBJS = \
vacuum.o \
vacuumparallel.o \
variable.o \
- view.o
+ view.o \
+ wait.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index dd4cde41d32..9f640ad4810 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -53,4 +53,5 @@ backend_sources += files(
'vacuumparallel.c',
'variable.c',
'view.c',
+ 'wait.c',
)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00000000000..67068a92dbf
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,212 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ * Implements WAIT FOR, which allows waiting for events such as
+ * time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
+#include "commands/defrem.h"
+#include "commands/wait.h"
+#include "executor/executor.h"
+#include "parser/parse_node.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+
+
+void
+ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
+{
+ XLogRecPtr lsn;
+ int64 timeout = 0;
+ WaitLSNResult waitLSNResult;
+ bool throw = true;
+ TupleDesc tupdesc;
+ TupOutputState *tstate;
+ const char *result = "<unset>";
+ bool timeout_specified = false;
+ bool no_throw_specified = false;
+
+ /* Parse and validate the mandatory LSN */
+ lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+ CStringGetDatum(stmt->lsn_literal)));
+
+ foreach_node(DefElem, defel, stmt->options)
+ {
+ if (strcmp(defel->defname, "timeout") == 0)
+ {
+ char *timeout_str;
+ const char *hintmsg;
+ double result;
+
+ if (timeout_specified)
+ errorConflictingDefElem(defel, pstate);
+ timeout_specified = true;
+
+ timeout_str = defGetString(defel);
+
+ if (!parse_real(timeout_str, &result, GUC_UNIT_MS, &hintmsg))
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid timeout value: \"%s\"", timeout_str),
+ hintmsg ? errhint("%s", _(hintmsg)) : 0);
+ }
+
+ /*
+ * Get rid of any fractional part in the input. This is so we
+ * don't fail on just-out-of-range values that would round into
+ * range.
+ */
+ result = rint(result);
+
+ /* Range check */
+ if (unlikely(isnan(result) || !FLOAT8_FITS_IN_INT64(result)))
+ ereport(ERROR,
+ errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("timeout value is out of range"));
+
+ if (result < 0)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("timeout cannot be negative"));
+
+ timeout = (int64) result;
+ }
+ else if (strcmp(defel->defname, "no_throw") == 0)
+ {
+ if (no_throw_specified)
+ errorConflictingDefElem(defel, pstate);
+
+ no_throw_specified = true;
+
+ throw = !defGetBoolean(defel);
+ }
+ else
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("option \"%s\" not recognized",
+ defel->defname),
+ parser_errposition(pstate, defel->location));
+ }
+ }
+
+ /*
+ * We are going to wait for the LSN replay. We should first care that we
+ * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * Otherwise, our snapshot could prevent the replay of WAL records
+ * implying a kind of self-deadlock. This is the reason why WAIT FOR is a
+ * command, not a procedure or function.
+ *
+ * At first, we should check there is no active snapshot. According to
+ * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+ * processed with a snapshot. Thankfully, we can pop this snapshot,
+ * because PortalRunUtility() can tolerate this.
+ */
+ if (ActiveSnapshotSet())
+ PopActiveSnapshot();
+
+ /*
+ * At second, invalidate a catalog snapshot if any. And we should be done
+ * with the preparation.
+ */
+ InvalidateCatalogSnapshot();
+
+ /* Give up if there is still an active or registered snapshot. */
+ if (HaveRegisteredOrActiveSnapshot())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("WAIT FOR must be only called without an active or registered snapshot"),
+ errdetail("WAIT FOR cannot be executed from a function or a procedure or within a transaction with an isolation level higher than READ COMMITTED."));
+
+ /*
+ * As the result we should hold no snapshot, and correspondingly our xmin
+ * should be unset.
+ */
+ Assert(MyProc->xmin == InvalidTransactionId);
+
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+
+ /*
+ * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * needed.
+ */
+ switch (waitLSNResult)
+ {
+ case WAIT_LSN_RESULT_SUCCESS:
+ /* Nothing to do on success */
+ result = "success";
+ break;
+
+ case WAIT_LSN_RESULT_TIMEOUT:
+ if (throw)
+ ereport(ERROR,
+ errcode(ERRCODE_QUERY_CANCELED),
+ errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ else
+ result = "timeout";
+ break;
+
+ case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
+ if (throw)
+ {
+ if (PromoteIsTriggered())
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ LSN_FORMAT_ARGS(lsn),
+ LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ }
+ else
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is not in progress"),
+ errhint("Waiting for the replay LSN can only be executed during recovery."));
+ }
+ else
+ result = "not in recovery";
+ break;
+ }
+
+ /* need a tuple descriptor representing a single TEXT column */
+ tupdesc = WaitStmtResultDesc(stmt);
+
+ /* prepare for projection of tuples */
+ tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsVirtual);
+
+ /* Send it */
+ do_text_output_oneline(tstate, result);
+
+ end_tup_output(tstate);
+}
+
+TupleDesc
+WaitStmtResultDesc(WaitStmt *stmt)
+{
+ TupleDesc tupdesc;
+
+ /* Need a tuple descriptor representing a single TEXT column */
+ tupdesc = CreateTemplateTupleDesc(1);
+ TupleDescInitEntry(tupdesc, (AttrNumber) 1, "status",
+ TEXTOID, -1, 0);
+ return tupdesc;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a4b29c822e8..a4e6f80504b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -308,7 +308,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
UnlistenStmt UpdateStmt VacuumStmt
VariableResetStmt VariableSetStmt VariableShowStmt
- ViewStmt CheckPointStmt CreateConversionStmt
+ ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
DeallocateStmt PrepareStmt ExecuteStmt
DropOwnedStmt ReassignOwnedStmt
AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -325,6 +325,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <boolean> opt_concurrently
%type <dbehavior> opt_drop_behavior
%type <list> opt_utility_option_list
+%type <list> opt_wait_with_clause
%type <list> utility_option_list
%type <defelt> utility_option_elem
%type <str> utility_option_name
@@ -678,7 +679,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
json_object_constructor_null_clause_opt
json_array_constructor_null_clause_opt
-
/*
* Non-keyword token types. These are hard-wired into the "flex" lexer.
* They must be listed first so that their numeric codes do not depend on
@@ -748,7 +748,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
- LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+ LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN_P
MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE MERGE_ACTION METHOD
MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -792,7 +792,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
VERBOSE VERSION_P VIEW VIEWS VIRTUAL VOLATILE
- WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+ WAIT WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1120,6 +1120,7 @@ stmt:
| VariableSetStmt
| VariableShowStmt
| ViewStmt
+ | WaitStmt
| /*EMPTY*/
{ $$ = NULL; }
;
@@ -16462,6 +16463,26 @@ xml_passing_mech:
| BY VALUE_P
;
+/*****************************************************************************
+ *
+ * WAIT FOR LSN
+ *
+ *****************************************************************************/
+
+WaitStmt:
+ WAIT FOR LSN_P Sconst opt_wait_with_clause
+ {
+ WaitStmt *n = makeNode(WaitStmt);
+ n->lsn_literal = $4;
+ n->options = $5;
+ $$ = (Node *) n;
+ }
+ ;
+
+opt_wait_with_clause:
+ WITH '(' utility_option_list ')' { $$ = $3; }
+ | /*EMPTY*/ { $$ = NIL; }
+ ;
/*
* Aggregate decoration clauses
@@ -17949,6 +17970,7 @@ unreserved_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18119,6 +18141,7 @@ unreserved_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHITESPACE_P
| WITHIN
| WITHOUT
@@ -18565,6 +18588,7 @@ bare_label_keyword:
| LOCK_P
| LOCKED
| LOGGED
+ | LSN_P
| MAPPING
| MATCH
| MATCHED
@@ -18776,6 +18800,7 @@ bare_label_keyword:
| VIEWS
| VIRTUAL
| VOLATILE
+ | WAIT
| WHEN
| WHITESPACE_P
| WORK
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 96f29aafc39..26b201eadb8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xlogutils.h"
+#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/autovacuum.h"
@@ -947,6 +948,11 @@ ProcKill(int code, Datum arg)
*/
LWLockReleaseAll();
+ /*
+ * Cleanup waiting for LSN if any.
+ */
+ WaitLSNCleanup();
+
/* Cancel any pending condition variable sleep, too */
ConditionVariableCancelSleep();
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 74179139fa9..fde78c55160 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1158,10 +1158,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
MemoryContextSwitchTo(portal->portalContext);
/*
- * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
- * under us, so don't complain if it's now empty. Otherwise, our snapshot
- * should be the top one; pop it. Note that this could be a different
- * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+ * Some utility commands (e.g., VACUUM, WAIT FOR) pop the ActiveSnapshot
+ * stack from under us, so don't complain if it's now empty. Otherwise,
+ * our snapshot should be the top one; pop it. Note that this could be a
+ * different snapshot from the one we made above; see
+ * EnsurePortalSnapshotExists.
*/
if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
{
@@ -1738,7 +1739,8 @@ PlannedStmtRequiresSnapshot(PlannedStmt *pstmt)
IsA(utilityStmt, ListenStmt) ||
IsA(utilityStmt, NotifyStmt) ||
IsA(utilityStmt, UnlistenStmt) ||
- IsA(utilityStmt, CheckPointStmt))
+ IsA(utilityStmt, CheckPointStmt) ||
+ IsA(utilityStmt, WaitStmt))
return false;
return true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 918db53dd5e..082967c0a86 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
#include "commands/user.h"
#include "commands/vacuum.h"
#include "commands/view.h"
+#include "commands/wait.h"
#include "miscadmin.h"
#include "parser/parse_utilcmd.h"
#include "postmaster/bgwriter.h"
@@ -266,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_PrepareStmt:
case T_UnlistenStmt:
case T_VariableSetStmt:
+ case T_WaitStmt:
{
/*
* These modify only backend-local state, so they're OK to run
@@ -1055,6 +1057,12 @@ standard_ProcessUtility(PlannedStmt *pstmt,
break;
}
+ case T_WaitStmt:
+ {
+ ExecWaitStmt(pstate, (WaitStmt *) parsetree, dest);
+ }
+ break;
+
default:
/* All other statement types have event trigger support */
ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2059,6 +2067,9 @@ UtilityReturnsTuples(Node *parsetree)
case T_VariableShowStmt:
return true;
+ case T_WaitStmt:
+ return true;
+
default:
return false;
}
@@ -2114,6 +2125,9 @@ UtilityTupleDescriptor(Node *parsetree)
return GetPGVariableResultDesc(n->name);
}
+ case T_WaitStmt:
+ return WaitStmtResultDesc((WaitStmt *) parsetree);
+
default:
return NULL;
}
@@ -3091,6 +3105,10 @@ CreateCommandTag(Node *parsetree)
}
break;
+ case T_WaitStmt:
+ tag = CMDTAG_WAIT;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
@@ -3689,6 +3707,10 @@ GetCommandLogLevel(Node *parsetree)
lev = LOGSTMT_DDL;
break;
+ case T_WaitStmt:
+ lev = LOGSTMT_ALL;
+ break;
+
/* already-planned queries */
case T_PlannedStmt:
{
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 00000000000..ce332134fb3
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,22 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ * prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+
+#include "nodes/parsenodes.h"
+#include "parser/parse_node.h"
+#include "tcop/dest.h"
+
+extern void ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest);
+extern TupleDesc WaitStmtResultDesc(WaitStmt *stmt);
+
+#endif /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ecbddd12e1b..d14294a4ece 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,4 +4385,12 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+typedef struct WaitStmt
+{
+ NodeTag type;
+ char *lsn_literal; /* LSN string from grammar */
+ List *options; /* List of DefElem nodes */
+} WaitStmt;
+
+
#endif /* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 84182eaaae2..5d4fe27ef96 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -270,6 +270,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -496,6 +497,7 @@ PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("virtual", VIRTUAL, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..c4606d65043 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 52993c32dbb..523a5cd5b52 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -56,7 +56,8 @@ tests += {
't/045_archive_restartpoint.pl',
't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
- 't/048_vacuum_horizon_floor.pl'
+ 't/048_vacuum_horizon_floor.pl',
+ 't/049_wait_for_lsn.pl',
],
},
}
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
new file mode 100644
index 00000000000..e0ddb06a2f0
--- /dev/null
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -0,0 +1,302 @@
+# Checks waiting for the LSN replay on standby using
+# the WAIT FOR command.
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+ "CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+ has_streaming => 1);
+$node_standby->append_conf(
+ 'postgresql.conf', qq[
+ recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# 1. Make sure that WAIT FOR works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn1}' WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok((split("\n", $output))[-1] >= 0,
+ "standby reached the same LSN as primary after WAIT FOR");
+
+# 2. Check that new data is visible after calling WAIT FOR
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}';
+ SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok((split("\n", $output))[-1] eq 30,
+ "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# unreachable LSN must be well in advance. So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+ "WAIT FOR LSN '${lsn2}' WITH (timeout '10ms');");
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${lsn3}' WITH (timeout '1000ms');",
+ stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+ "get timeout on waiting for unreachable LSN");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success",
+ "WAIT FOR returns correct status after successful waiting");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
+
+# 4. Check that WAIT FOR triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
+
+$node_primary->psql('postgres', "WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+ "get an error when running on the primary");
+
+$node_standby->psql(
+ 'postgres',
+ "BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; WAIT FOR LSN '${lsn3}';",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+ BEGIN
+ EXECUTE format('WAIT FOR LSN %L;', target_lsn);
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
+ 'postgres',
+ "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+ stderr => \$stderr);
+ok( $stderr =~
+ /WAIT FOR must be only called without an active or registered snapshot/,
+ "get an error when running within another function");
+
+# 5. Check parameter validation error cases on standby before promotion
+my $test_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Test negative timeout
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '-1000ms');",
+ stderr => \$stderr);
+ok($stderr =~ /timeout cannot be negative/, "get error for negative timeout");
+
+# Test unknown parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (unknown_param 'value');",
+ stderr => \$stderr);
+ok($stderr =~ /option "unknown_param" not recognized/,
+ "get error for unknown parameter");
+
+# Test duplicate TIMEOUT parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout '1000', timeout '2000');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate TIMEOUT parameter");
+
+# Test duplicate NO_THROW parameter with WITH clause
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (no_throw, no_throw);",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate NO_THROW parameter");
+
+# Test syntax error - options without WITH keyword
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' (timeout '100ms');",
+ stderr => \$stderr);
+ok($stderr =~ /syntax error/,
+ "get syntax error when options specified without WITH keyword");
+
+# Test syntax error - missing LSN
+$node_standby->psql('postgres', "WAIT FOR TIMEOUT 1000;", stderr => \$stderr);
+ok($stderr =~ /syntax error/, "get syntax error for missing LSN");
+
+# Test invalid LSN format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN 'invalid_lsn';",
+ stderr => \$stderr);
+ok($stderr =~ /invalid input syntax for type pg_lsn/,
+ "get error for invalid LSN format");
+
+# Test invalid timeout format
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (timeout 'invalid');",
+ stderr => \$stderr);
+ok($stderr =~ /invalid timeout value/,
+ "get error for invalid timeout format");
+
+# Test new WITH clause syntax
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn2}' WITH (timeout '0.1s', no_throw);]);
+ok($output eq "success", "WAIT FOR WITH clause syntax works correctly");
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn3}' WITH (timeout 100, no_throw);]);
+ok($output eq "timeout",
+ "WAIT FOR WITH clause returns correct timeout status");
+
+# Test WITH clause error case - invalid option
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (invalid_option 'value');",
+ stderr => \$stderr);
+ok( $stderr =~ /option "invalid_option" not recognized/,
+ "get error for invalid WITH clause option");
+
+# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# psql sessions each waiting for a corresponding insertion. When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+ 'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+ DECLARE
+ count int;
+ BEGIN
+ SELECT count(*) FROM wait_test INTO count;
+ IF count >= 31 + i THEN
+ RAISE LOG 'count %', i;
+ END IF;
+ END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (${i});");
+ my $lsn =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $psql_sessions[$i] = $node_standby->background_psql('postgres');
+ $psql_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn}';
+ SELECT log_count(${i});
+ ]);
+}
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("count ${i}", $log_offset);
+ $psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 7. Check that the standby promotion terminates the wait on LSN. Start
+# waiting for an unreachable LSN then promote. Check the log for the relevant
+# error message. Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}';
+]);
+
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed. Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+
+ok(1, 'wait for already replayed LSN exits immediately even after promotion');
+
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ok($output eq "not in recovery",
+ "WAIT FOR returns correct status after standby promotion");
+
+
+$node_standby->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 237d33c538c..e34dcf97df8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3270,6 +3270,7 @@ WaitLSNState
WaitLSNProcInfo
WaitLSNResult
WaitPMResult
+WaitStmt
WalCloseMethod
WalCompression
WalInsertClass
--
2.39.5 (Apple Git-154)
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.
Sure. I don't understand why patches 0002 and 0003 are separate though.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Hi,
On 2025-11-03 16:06:58 +0100, �lvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...
Greetings,
Andres Freund
Hi!
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.
Thank you for the feedback, pushed.
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...
The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v1-0001-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchapplication/octet-stream; name=v1-0001-Use-WAIT-FOR-LSN-in-PostgreSQL-Test-Cluster-wait_.patchDownload
From bb12721dc3efbd213416adb8c3563bb0c11c023b Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 5 Nov 2025 11:10:04 +0200
Subject: [PATCH v1] Use WAIT FOR LSN in
PostgreSQL::Test::Cluster::wait_for_catchup()
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 27 ++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..85b5d9863cd 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,8 @@ sub wait_for_catchup
$mode = defined($mode) ? $mode : 'replay';
my %valid_modes =
('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+ my $isrecovery =
+ $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3340,8 +3342,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
- my $isrecovery =
- $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
chomp($isrecovery);
if ($isrecovery eq 't')
{
@@ -3360,6 +3360,29 @@ sub wait_for_catchup
. $self->name . "\n";
# Before release 12 walreceiver just set the application name to
# "walreceiver"
+
+ # Use WAIT FOR LSN when appropriate
+ if (($mode eq 'replay') && ($isrecovery eq 't'))
+ {
+ my $query =
+ qq[WAIT FOR LSN '${target_lsn}' WITH (timeout '${PostgreSQL::Test::Utils::timeout_default}s', no_throw);];
+ my $output = $self->safe_psql('postgres', $query);
+
+ if ($output ne 'success')
+ {
+ # Fetch additional detail for debugging purposes
+ $query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+ my $details = $self->safe_psql('postgres', $query);
+ diag qq(WAIT FOR LSN failed with status:
+ ${output});
+ diag qq(Last pg_stat_replication contents:
+ ${details});
+ croak "failed waiting for catchup";
+ }
+ print "done\n";
+ return;
+ }
+
my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('$standby_name', 'walreceiver')];
--
2.39.5 (Apple Git-154)
Hi,
On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi!
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
Thanks for pushing them!
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.
Interesting, could this approach be extended to the flush and other
modes as well? I might need to spend some time to understand it before
I can provide a meaningful review.
Best,
Xuneng
On Wed, Nov 5, 2025 at 4:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
Thanks for pushing them!
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.Interesting, could this approach be extended to the flush and other
modes as well? I might need to spend some time to understand it before
I can provide a meaningful review.
I think we might end up extending WaitLSNType enum. However, I hate
inHeap and heapNode arrays growing in WaitLSNProcInfo as they are
allocated per process. I found that we could optimize WaitLSNProcInfo
struct turning them into simple variables because a single process can
wait only for a single LSN at a time. Please, check the attached
patch.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v1-0001-Optimize-shared-memory-usage-for-WaitLSNProcInfo.patchapplication/octet-stream; name=v1-0001-Optimize-shared-memory-usage-for-WaitLSNProcInfo.patchDownload
From 89fde94dd74810d2bf349af33b7ca9585080c0f6 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 7 Nov 2025 23:49:47 +0200
Subject: [PATCH v1] Optimize shared memory usage for WaitLSNProcInfo
We need separate pairing heaps for different WaitLSNType's, because there
might be waiters for different LSN's at the same time. However, one process
can wait only for one type of LSN at a time. So, not need for inHeap
and heapNode fields to be arrays.
---
src/backend/access/transam/xlogwait.c | 40 ++++++++++++---------------
src/include/access/xlogwait.h | 7 +++--
2 files changed, 22 insertions(+), 25 deletions(-)
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 34fa41ed9b2..e1eb21be125 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -90,7 +90,7 @@ WaitLSNShmemInit(void)
for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
{
pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
- pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+ pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, NULL);
}
/* Initialize process info array */
@@ -106,9 +106,8 @@ WaitLSNShmemInit(void)
static int
waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
{
- int i = (uintptr_t) arg;
- const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
- const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode, b);
if (aproc->waitLSN < bproc->waitLSN)
return 1;
@@ -132,7 +131,7 @@ updateMinWaitedLSN(WaitLSNType lsnType)
if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
{
pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
- WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode, node);
minWaitedLSN = procInfo->waitLSN;
}
@@ -154,10 +153,11 @@ addLSNWaiter(XLogRecPtr lsn, WaitLSNType lsnType)
procInfo->procno = MyProcNumber;
procInfo->waitLSN = lsn;
+ procInfo->lsnType = lsnType;
- Assert(!procInfo->inHeap[i]);
- pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
- procInfo->inHeap[i] = true;
+ Assert(!procInfo->inHeap);
+ pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode);
+ procInfo->inHeap = true;
updateMinWaitedLSN(lsnType);
LWLockRelease(WaitLSNLock);
@@ -176,10 +176,10 @@ deleteLSNWaiter(WaitLSNType lsnType)
LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
- if (procInfo->inHeap[i])
+ if (procInfo->inHeap)
{
- pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
- procInfo->inHeap[i] = false;
+ pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode);
+ procInfo->inHeap = false;
updateMinWaitedLSN(lsnType);
}
@@ -228,7 +228,7 @@ wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
WaitLSNProcInfo *procInfo;
/* Get procInfo using appropriate heap node */
- procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+ procInfo = pairingheap_container(WaitLSNProcInfo, heapNode, node);
if (XLogRecPtrIsValid(currentLSN) && procInfo->waitLSN > currentLSN)
break;
@@ -238,7 +238,7 @@ wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
(void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
/* Update appropriate flag */
- procInfo->inHeap[i] = false;
+ procInfo->inHeap = false;
if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
break;
@@ -285,20 +285,14 @@ WaitLSNCleanup(void)
{
if (waitLSNState)
{
- int i;
-
/*
- * We do a fast-path check of the heap flags without the lock. These
- * flags are set to true only by the process itself. So, it's only
+ * We do a fast-path check of the inHeap flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only
* possible to get a false positive. But that will be eliminated by a
* recheck inside deleteLSNWaiter().
*/
-
- for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
- {
- if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
- deleteLSNWaiter((WaitLSNType) i);
- }
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter(waitLSNState->procInfos[MyProcNumber].lsnType);
}
}
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 4dc328b1b07..46bac74988b 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -50,14 +50,17 @@ typedef struct WaitLSNProcInfo
/* LSN, which this process is waiting for */
XLogRecPtr waitLSN;
+ /* The type of LSN to wait */
+ WaitLSNType lsnType;
+
/* Process to wake up once the waitLSN is reached */
ProcNumber procno;
/* Heap membership flags for LSN types */
- bool inHeap[WAIT_LSN_TYPE_COUNT];
+ bool inHeap;
/* Heap nodes for LSN types */
- pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+ pairingheap_node heapNode;
} WaitLSNProcInfo;
/*
--
2.39.5 (Apple Git-154)
Hi,
On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi!
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.------
Regards,
Alexander Korotkov
Supabase
I tested the patch using make check-world, and it worked well. I also
made a few adjustments:
- Added an unconditional chomp($isrecovery) after querying
pg_is_in_recovery() to prevent newline mismatches when $target_lsn is
accidently defined.
- Added chomp($output) to normalize the result from WAIT FOR LSN
before comparison.
At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.
--
Best,
Xuneng
Attachments:
v2-0001-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v2-0001-Use-WAIT-FOR-LSN-in.patchDownload
From e24a00603080d476087b8e327284d849f72d86a8 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 12 Nov 2025 13:32:05 +0800
Subject: [PATCH v2] Use WAIT FOR LSN in
PostgreSQL::Test::Cluster::wait_for_catchup()
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 32 +++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..41784553d4b 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
$mode = defined($mode) ? $mode : 'replay';
my %valid_modes =
('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+ my $isrecovery =
+ $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+ chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
- my $isrecovery =
- $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
- chomp($isrecovery);
if ($isrecovery eq 't')
{
$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,32 @@ sub wait_for_catchup
. $self->name . "\n";
# Before release 12 walreceiver just set the application name to
# "walreceiver"
+
+ # Use WAIT FOR LSN when appropriate
+ if (($mode eq 'replay') && ($isrecovery eq 't'))
+ {
+ my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+ my $query =
+ qq[WAIT FOR LSN '${target_lsn}' WITH (timeout '${timeout}s', no_throw);];
+ my $output = $self->safe_psql('postgres', $query);
+ chomp($output);
+
+ if ($output ne 'success')
+ {
+ # Fetch additional detail for debugging purposes
+ $query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+ my $details = $self->safe_psql('postgres', $query);
+ diag qq(WAIT FOR LSN failed with status:
+${output});
+ diag qq(Last pg_stat_replication contents:
+${details});
+ croak "failed waiting for catchup";
+ }
+ print "done\n";
+ return;
+ }
+
+ # Polling for other modes or when WAIT FOR LSN is not applicable
my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('$standby_name', 'walreceiver')];
--
2.51.0
On 11/5/25 10:51, Alexander Korotkov wrote:
Hi!
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
Hi,
The new TAP test 049_wait_for_lsn.pl introduced by this commit, because
it takes a long time - about 65 seconds on my laptop. That's about 25%
of the whole src/test/recovery, more than any other test.
And most of the time there's nothing happening - these are the two log
messages showing the 60-second wait:
2025-11-13 21:12:39.949 CET checkpointer[562597] LOG: checkpoint
complete: wrote 9 buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s)
added, 0 removed, 2 recycled; write=0.906 s, sync=0.001 s, total=0.907
s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB,
estimate=32768 kB; lsn=0/040000B8, redo lsn=0/04000060
2025-11-13 21:13:38.994 CET client backend[562727] 049_wait_for_lsn.pl
ERROR: recovery is not in progress
So there's a checkpoint, 60 seconds of nothing, and then a failure. I
haven't looked into why it waits for 1 minute exactly, but adding 60
seconds to check-world is somewhat annoying.
While at it, I noticed a couple comments refer to WaitForLSNReplay, but
but I think that got renamed simply to WaitForLSN.
regards
--
Tomas Vondra
Hi Tomas,
On Fri, Nov 14, 2025 at 4:32 AM Tomas Vondra <tomas@vondra.me> wrote:
On 11/5/25 10:51, Alexander Korotkov wrote:
Hi!
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
Hi,
The new TAP test 049_wait_for_lsn.pl introduced by this commit, because
it takes a long time - about 65 seconds on my laptop. That's about 25%
of the whole src/test/recovery, more than any other test.And most of the time there's nothing happening - these are the two log
messages showing the 60-second wait:2025-11-13 21:12:39.949 CET checkpointer[562597] LOG: checkpoint
complete: wrote 9 buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s)
added, 0 removed, 2 recycled; write=0.906 s, sync=0.001 s, total=0.907
s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB,
estimate=32768 kB; lsn=0/040000B8, redo lsn=0/040000602025-11-13 21:13:38.994 CET client backend[562727] 049_wait_for_lsn.pl
ERROR: recovery is not in progressSo there's a checkpoint, 60 seconds of nothing, and then a failure. I
haven't looked into why it waits for 1 minute exactly, but adding 60
seconds to check-world is somewhat annoying.
Thanks for looking into this!
I did a quick analysis for this prolonged waiting:
In WaitLSNWakeup() (xlogwait.c:267), the fast-path check incorrectly
handled InvalidXLogRecPtr:
/* Fast path check */
if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return; // Issue: Returns early when currentLSN = 0
When currentLSN = InvalidXLogRecPtr (0), meaning "wake all waiters",
the check compared:
- minWaitedLSN (e.g., 0x570CC048) > 0 → TRUE
- Result: function returned early without waking anyone
When It Happened
During standby promotion, xlog.c:6246 calls:
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
This should wake all LSN waiters, but the bug prevented it. WAIT FOR
LSN commands could wait indefinitely. Test 049_wait_for_lsn.pl took 68
seconds instead of ~9 seconds.
if the above analysis is sound, the fix could be like:
Proposed fix:
Added a validity check before the comparison:
/*
* Fast path check. Skip if currentLSN is InvalidXLogRecPtr, which means
* "wake all waiters" (e.g., during promotion when recovery ends).
*/
if (XLogRecPtrIsValid(currentLSN) &&
pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return;
Result:
Test time: 68s → 9s
WAIT FOR LSN exits immediately on promotion (62ms vs 60s)
While at it, I noticed a couple comments refer to WaitForLSNReplay, but
but I think that got renamed simply to WaitForLSN.
Please check the attached patch for replacing them.
--
Best,
Xuneng
Attachments:
v1-0001-Fix-incorrect-function-name-in-comments.patchapplication/octet-stream; name=v1-0001-Fix-incorrect-function-name-in-comments.patchDownload
From ce6227035eab97b6a67d97fd58e88dc1392a47c7 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 14 Nov 2025 09:39:31 +0800
Subject: [PATCH v1] Fix incorrect function name in comments
Update comments to reference WaitForLSN() instead of the outdated
WaitForLSNReplay() function name.
---
src/backend/commands/wait.c | 2 +-
src/include/access/xlogwait.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 67068a92dbf..9c4764cf896 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -143,7 +143,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
/*
- * Process the result of WaitForLSNReplay(). Throw appropriate error if
+ * Process the result of WaitForLSN(). Throw appropriate error if
* needed.
*/
switch (waitLSNResult)
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 4dc328b1b07..f43e481c3b9 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -20,7 +20,7 @@
#include "tcop/dest.h"
/*
- * Result statuses for WaitForLSNReplay().
+ * Result statuses for WaitForLSN().
*/
typedef enum
{
--
2.51.0
v1-0001-Fix-WaitLSNWakeup-fast-path-check-for-InvalidXLog.patchapplication/octet-stream; name=v1-0001-Fix-WaitLSNWakeup-fast-path-check-for-InvalidXLog.patchDownload
From de673ec025074cd95ad4a4e53e2c26fcc14d5a4a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Fri, 14 Nov 2025 09:34:03 +0800
Subject: [PATCH v1] Fix WaitLSNWakeup() fast-path check for InvalidXLogRecPtr
WaitLSNWakeup() incorrectly returned early when called with
InvalidXLogRecPtr (meaning "wake all waiters"), because the fast-path
check compared minWaitedLSN > 0 without validating currentLSN first.
This caused WAIT FOR LSN commands to wait indefinitely during standby
promotion until random signals woke them.
Add XLogRecPtrIsValid() check before the comparison so InvalidXLogRecPtr
bypasses the fast-path and wakes all waiters immediately.
---
src/backend/access/transam/xlogwait.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 34fa41ed9b2..78de93db47f 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -270,8 +270,12 @@ WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN)
Assert(i >= 0 && i < (int) WAIT_LSN_TYPE_COUNT);
- /* Fast path check */
- if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
+ /*
+ * Fast path check. Skip if currentLSN is InvalidXLogRecPtr, which means
+ * "wake all waiters" (e.g., during promotion when recovery ends).
+ */
+ if (XLogRecPtrIsValid(currentLSN) &&
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return;
wakeupWaiters(lsnType, currentLSN);
--
2.51.0
Hi, Xuneng!
On Fri, Nov 14, 2025 at 3:50 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Fri, Nov 14, 2025 at 4:32 AM Tomas Vondra <tomas@vondra.me> wrote:
On 11/5/25 10:51, Alexander Korotkov wrote:
Hi!
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
Hi,
The new TAP test 049_wait_for_lsn.pl introduced by this commit, because
it takes a long time - about 65 seconds on my laptop. That's about 25%
of the whole src/test/recovery, more than any other test.And most of the time there's nothing happening - these are the two log
messages showing the 60-second wait:2025-11-13 21:12:39.949 CET checkpointer[562597] LOG: checkpoint
complete: wrote 9 buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s)
added, 0 removed, 2 recycled; write=0.906 s, sync=0.001 s, total=0.907
s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB,
estimate=32768 kB; lsn=0/040000B8, redo lsn=0/040000602025-11-13 21:13:38.994 CET client backend[562727] 049_wait_for_lsn.pl
ERROR: recovery is not in progressSo there's a checkpoint, 60 seconds of nothing, and then a failure. I
haven't looked into why it waits for 1 minute exactly, but adding 60
seconds to check-world is somewhat annoying.Thanks for looking into this!
I did a quick analysis for this prolonged waiting:
In WaitLSNWakeup() (xlogwait.c:267), the fast-path check incorrectly
handled InvalidXLogRecPtr:
/* Fast path check */
if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return; // Issue: Returns early when currentLSN = 0When currentLSN = InvalidXLogRecPtr (0), meaning "wake all waiters",
the check compared:
- minWaitedLSN (e.g., 0x570CC048) > 0 → TRUE
- Result: function returned early without waking anyoneWhen It Happened
During standby promotion, xlog.c:6246 calls:WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
This should wake all LSN waiters, but the bug prevented it. WAIT FOR
LSN commands could wait indefinitely. Test 049_wait_for_lsn.pl took 68
seconds instead of ~9 seconds.if the above analysis is sound, the fix could be like:
Proposed fix:
Added a validity check before the comparison:
/*
* Fast path check. Skip if currentLSN is InvalidXLogRecPtr, which means
* "wake all waiters" (e.g., during promotion when recovery ends).
*/
if (XLogRecPtrIsValid(currentLSN) &&
pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return;Result:
Test time: 68s → 9s
WAIT FOR LSN exits immediately on promotion (62ms vs 60s)While at it, I noticed a couple comments refer to WaitForLSNReplay, but
but I think that got renamed simply to WaitForLSN.Please check the attached patch for replacing them.
Thank you so much for your patches!
Pushed with minor corrections.
------
Regards,
Alexander Korotkov
Supabase
On Sat, Nov 8, 2025 at 12:02 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Nov 5, 2025 at 4:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
Thanks for pushing them!
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.Interesting, could this approach be extended to the flush and other
modes as well? I might need to spend some time to understand it before
I can provide a meaningful review.I think we might end up extending WaitLSNType enum. However, I hate
inHeap and heapNode arrays growing in WaitLSNProcInfo as they are
allocated per process. I found that we could optimize WaitLSNProcInfo
struct turning them into simple variables because a single process can
wait only for a single LSN at a time. Please, check the attached
patch.
Here is the updated patch integrating minor corrections provided by
Xuneng Zhou off-list. I'm going to push this if no objections.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v3-0001-Optimize-shared-memory-usage-for-WaitLSNProcInfo.patchapplication/octet-stream; name=v3-0001-Optimize-shared-memory-usage-for-WaitLSNProcInfo.patchDownload
From 09c5f97fac0b14d82d2108d4b31777c7a639608e Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sun, 16 Nov 2025 14:06:50 +0200
Subject: [PATCH v3] Optimize shared memory usage for WaitLSNProcInfo
We need separate pairing heaps for different WaitLSNType's, because there
might be waiters for different LSN's at the same time. However, one process
can wait only for one type of LSN at a time. So, no need for inHeap
and heapNode fields to be arrays.
Discussion: https://postgr.es/m/CAPpHfdsBR-7sDtXFJ1qpJtKiohfGoj%3DvqzKVjWxtWsWidx7G_A%40mail.gmail.com
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
---
src/backend/access/transam/xlogwait.c | 42 ++++++++++++---------------
src/include/access/xlogwait.h | 14 ++++++---
2 files changed, 29 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 78de93db47f..98aa5f1e4a2 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -90,7 +90,7 @@ WaitLSNShmemInit(void)
for (i = 0; i < WAIT_LSN_TYPE_COUNT; i++)
{
pg_atomic_init_u64(&waitLSNState->minWaitedLSN[i], PG_UINT64_MAX);
- pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, (void *) (uintptr_t) i);
+ pairingheap_initialize(&waitLSNState->waitersHeap[i], waitlsn_cmp, NULL);
}
/* Initialize process info array */
@@ -106,9 +106,8 @@ WaitLSNShmemInit(void)
static int
waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
{
- int i = (uintptr_t) arg;
- const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], a);
- const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode[i], b);
+ const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, heapNode, a);
+ const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, heapNode, b);
if (aproc->waitLSN < bproc->waitLSN)
return 1;
@@ -132,7 +131,7 @@ updateMinWaitedLSN(WaitLSNType lsnType)
if (!pairingheap_is_empty(&waitLSNState->waitersHeap[i]))
{
pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap[i]);
- WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+ WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, heapNode, node);
minWaitedLSN = procInfo->waitLSN;
}
@@ -154,10 +153,11 @@ addLSNWaiter(XLogRecPtr lsn, WaitLSNType lsnType)
procInfo->procno = MyProcNumber;
procInfo->waitLSN = lsn;
+ procInfo->lsnType = lsnType;
- Assert(!procInfo->inHeap[i]);
- pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
- procInfo->inHeap[i] = true;
+ Assert(!procInfo->inHeap);
+ pairingheap_add(&waitLSNState->waitersHeap[i], &procInfo->heapNode);
+ procInfo->inHeap = true;
updateMinWaitedLSN(lsnType);
LWLockRelease(WaitLSNLock);
@@ -176,10 +176,12 @@ deleteLSNWaiter(WaitLSNType lsnType)
LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
- if (procInfo->inHeap[i])
+ Assert(procInfo->lsnType == lsnType);
+
+ if (procInfo->inHeap)
{
- pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode[i]);
- procInfo->inHeap[i] = false;
+ pairingheap_remove(&waitLSNState->waitersHeap[i], &procInfo->heapNode);
+ procInfo->inHeap = false;
updateMinWaitedLSN(lsnType);
}
@@ -228,7 +230,7 @@ wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
WaitLSNProcInfo *procInfo;
/* Get procInfo using appropriate heap node */
- procInfo = pairingheap_container(WaitLSNProcInfo, heapNode[i], node);
+ procInfo = pairingheap_container(WaitLSNProcInfo, heapNode, node);
if (XLogRecPtrIsValid(currentLSN) && procInfo->waitLSN > currentLSN)
break;
@@ -238,7 +240,7 @@ wakeupWaiters(WaitLSNType lsnType, XLogRecPtr currentLSN)
(void) pairingheap_remove_first(&waitLSNState->waitersHeap[i]);
/* Update appropriate flag */
- procInfo->inHeap[i] = false;
+ procInfo->inHeap = false;
if (numWakeUpProcs == WAKEUP_PROC_STATIC_ARRAY_SIZE)
break;
@@ -289,20 +291,14 @@ WaitLSNCleanup(void)
{
if (waitLSNState)
{
- int i;
-
/*
- * We do a fast-path check of the heap flags without the lock. These
- * flags are set to true only by the process itself. So, it's only
+ * We do a fast-path check of the inHeap flag without the lock. This
+ * flag is set to true only by the process itself. So, it's only
* possible to get a false positive. But that will be eliminated by a
* recheck inside deleteLSNWaiter().
*/
-
- for (i = 0; i < (int) WAIT_LSN_TYPE_COUNT; i++)
- {
- if (waitLSNState->procInfos[MyProcNumber].inHeap[i])
- deleteLSNWaiter((WaitLSNType) i);
- }
+ if (waitLSNState->procInfos[MyProcNumber].inHeap)
+ deleteLSNWaiter(waitLSNState->procInfos[MyProcNumber].lsnType);
}
}
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index f43e481c3b9..e607441d618 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -50,14 +50,20 @@ typedef struct WaitLSNProcInfo
/* LSN, which this process is waiting for */
XLogRecPtr waitLSN;
+ /* The type of LSN to wait */
+ WaitLSNType lsnType;
+
/* Process to wake up once the waitLSN is reached */
ProcNumber procno;
- /* Heap membership flags for LSN types */
- bool inHeap[WAIT_LSN_TYPE_COUNT];
+ /*
+ * Heap membership flag. A process can wait for only one LSN type at a
+ * time, so a single flag suffices (tracked by the lsnType field).
+ */
+ bool inHeap;
- /* Heap nodes for LSN types */
- pairingheap_node heapNode[WAIT_LSN_TYPE_COUNT];
+ /* Pairing heap node for the waiters' heap (one per process) */
+ pairingheap_node heapNode;
} WaitLSNProcInfo;
/*
--
2.39.5 (Apple Git-154)
On Wed, Nov 12, 2025 at 9:20 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.I tested the patch using make check-world, and it worked well. I also
made a few adjustments:- Added an unconditional chomp($isrecovery) after querying
pg_is_in_recovery() to prevent newline mismatches when $target_lsn is
accidently defined.
- Added chomp($output) to normalize the result from WAIT FOR LSN
before comparison.At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.
I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).
------
Regards,
Alexander Korotkov
Supabase
Hi Alexander,
On Sat, Nov 15, 2025 at 6:29 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi, Xuneng!
On Fri, Nov 14, 2025 at 3:50 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Fri, Nov 14, 2025 at 4:32 AM Tomas Vondra <tomas@vondra.me> wrote:
On 11/5/25 10:51, Alexander Korotkov wrote:
Hi!
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
Hi,
The new TAP test 049_wait_for_lsn.pl introduced by this commit, because
it takes a long time - about 65 seconds on my laptop. That's about 25%
of the whole src/test/recovery, more than any other test.And most of the time there's nothing happening - these are the two log
messages showing the 60-second wait:2025-11-13 21:12:39.949 CET checkpointer[562597] LOG: checkpoint
complete: wrote 9 buffers (7.0%), wrote 3 SLRU buffers; 0 WAL file(s)
added, 0 removed, 2 recycled; write=0.906 s, sync=0.001 s, total=0.907
s; sync files=0, longest=0.000 s, average=0.000 s; distance=32768 kB,
estimate=32768 kB; lsn=0/040000B8, redo lsn=0/040000602025-11-13 21:13:38.994 CET client backend[562727] 049_wait_for_lsn.pl
ERROR: recovery is not in progressSo there's a checkpoint, 60 seconds of nothing, and then a failure. I
haven't looked into why it waits for 1 minute exactly, but adding 60
seconds to check-world is somewhat annoying.Thanks for looking into this!
I did a quick analysis for this prolonged waiting:
In WaitLSNWakeup() (xlogwait.c:267), the fast-path check incorrectly
handled InvalidXLogRecPtr:
/* Fast path check */
if (pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return; // Issue: Returns early when currentLSN = 0When currentLSN = InvalidXLogRecPtr (0), meaning "wake all waiters",
the check compared:
- minWaitedLSN (e.g., 0x570CC048) > 0 → TRUE
- Result: function returned early without waking anyoneWhen It Happened
During standby promotion, xlog.c:6246 calls:WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
This should wake all LSN waiters, but the bug prevented it. WAIT FOR
LSN commands could wait indefinitely. Test 049_wait_for_lsn.pl took 68
seconds instead of ~9 seconds.if the above analysis is sound, the fix could be like:
Proposed fix:
Added a validity check before the comparison:
/*
* Fast path check. Skip if currentLSN is InvalidXLogRecPtr, which means
* "wake all waiters" (e.g., during promotion when recovery ends).
*/
if (XLogRecPtrIsValid(currentLSN) &&
pg_atomic_read_u64(&waitLSNState->minWaitedLSN[i]) > currentLSN)
return;Result:
Test time: 68s → 9s
WAIT FOR LSN exits immediately on promotion (62ms vs 60s)While at it, I noticed a couple comments refer to WaitForLSNReplay, but
but I think that got renamed simply to WaitForLSN.Please check the attached patch for replacing them.
Thank you so much for your patches!
Pushed with minor corrections.
Thanks for pushing! It appears I should be running pgindent more regularly :).
--
Best,
Xuneng
Hi!
On Sun, Nov 16, 2025 at 8:09 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Sat, Nov 8, 2025 at 12:02 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Nov 5, 2025 at 4:03 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
Thanks for pushing them!
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.Interesting, could this approach be extended to the flush and other
modes as well? I might need to spend some time to understand it before
I can provide a meaningful review.I think we might end up extending WaitLSNType enum. However, I hate
inHeap and heapNode arrays growing in WaitLSNProcInfo as they are
allocated per process. I found that we could optimize WaitLSNProcInfo
struct turning them into simple variables because a single process can
wait only for a single LSN at a time. Please, check the attached
patch.Here is the updated patch integrating minor corrections provided by
Xuneng Zhou off-list. I'm going to push this if no objections.------
Regards,
Alexander Korotkov
Supabase
LGTM. Thanks.
--
Best,
Xuneng
Hi!
On Sun, Nov 16, 2025 at 8:37 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Nov 12, 2025 at 9:20 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.I tested the patch using make check-world, and it worked well. I also
made a few adjustments:- Added an unconditional chomp($isrecovery) after querying
pg_is_in_recovery() to prevent newline mismatches when $target_lsn is
accidently defined.
- Added chomp($output) to normalize the result from WAIT FOR LSN
before comparison.At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).------
Regards,
Alexander Korotkov
Supabase
Makes sense. I'll play with it and try to prepare a follow-up patch.
--
Best,
Xuneng
On Sun, Nov 16, 2025 at 3:25 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Sat, Nov 15, 2025 at 6:29 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Thank you so much for your patches!
Pushed with minor corrections.Thanks for pushing! It appears I should be running pgindent more regularly :).
Thank you. pgindent is not a problem for me, cause I anyway run it
every time before pushing a patch. But yes, if you make it a habit to
run pgindent every time before publishing a patch, it would become
cleaner.
------
Regards,
Alexander Korotkov
Supabase
Hi Alexander, Hackers,
On Sun, Nov 16, 2025 at 10:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi!
On Sun, Nov 16, 2025 at 8:37 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Nov 12, 2025 at 9:20 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Wed, Nov 5, 2025 at 5:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Nov 3, 2025 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-11-03 16:06:58 +0100, Álvaro Herrera wrote:
On 2025-Nov-03, Alexander Korotkov wrote:
I'd like to give this subject another chance for pg19. I'm going to
push this if no objections.Sure. I don't understand why patches 0002 and 0003 are separate though.
FWIW, I appreciate such splits. Even if the functionality isn't usable
independently, it's still different type of code that's affected. And the
patches are each big enough to make that worthwhile for easier review.Thank you for the feedback, pushed.
One thing that'd be nice to do once we have WAIT FOR is to make the common
case of wait_for_catchup() use this facility, instead of polling...The draft patch for that is attached. WAIT FOR doesn't handle all the
possible use cases of wait_for_catchup(), but I've added usage when
it's appropriate.I tested the patch using make check-world, and it worked well. I also
made a few adjustments:- Added an unconditional chomp($isrecovery) after querying
pg_is_in_recovery() to prevent newline mismatches when $target_lsn is
accidently defined.
- Added chomp($output) to normalize the result from WAIT FOR LSN
before comparison.At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).------
Regards,
Alexander Korotkov
SupabaseMakes sense. I'll play with it and try to prepare a follow-up patch.
--
Best,
Xuneng
In terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.
Syntax Option A: Mode in the WITH Clause
WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');
With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.
Syntax Option B: Mode as Part of the Main Command
WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';
Or a more concise variant using keywords:
WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;
This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.
I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.
--
Best,
Xuneng
Hi!
At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).------
Regards,
Alexander Korotkov
SupabaseMakes sense. I'll play with it and try to prepare a follow-up patch.
--
Best,
XunengIn terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.Syntax Option A: Mode in the WITH Clause
WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.Syntax Option B: Mode as Part of the Main Command
WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';Or a more concise variant using keywords:
WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.
I've implemented a patch that adds MODE support to WAIT FOR LSN
The new grammar looks like:
——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——
Two modes added: flush and write
Design decisions:
1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.
2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.
3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`
The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait types
Adds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.
-------
0002 - Add pg_last_wal_write_lsn() SQL function
Adds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.
-------
0003 - Add MODE parameter to WAIT FOR LSN command
Extends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.
-------
0004 - Add tab completion for WAIT FOR LSN MODE parameter
Adds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.
-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()
Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.
The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.
Feedback welcome.
--
Best,
Xuneng
Attachments:
v1-0002-Add-pg_last_wal_write_lsn-SQL-function.patchapplication/octet-stream; name=v1-0002-Add-pg_last_wal_write_lsn-SQL-function.patchDownload
From 7227bca84a9233fb2d7c130294511d48d8458e2f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:07:52 +0800
Subject: [PATCH v1 2/5] Add pg_last_wal_write_lsn() SQL function
Returns the current WAL write position on a standby server using
GetWalRcvWriteRecPtr(). This enables verification of WAIT FOR LSN MODE WRITE
and operational monitoring of standby WAL write progress.
---
doc/src/sgml/func/func-admin.sgml | 19 +++++++++++++++++++
src/backend/access/transam/xlogfuncs.c | 19 +++++++++++++++++++
src/include/catalog/pg_proc.dat | 4 ++++
3 files changed, 42 insertions(+)
diff --git a/doc/src/sgml/func/func-admin.sgml b/doc/src/sgml/func/func-admin.sgml
index 1b465bc8ba7..ed4e77d12ba 100644
--- a/doc/src/sgml/func/func-admin.sgml
+++ b/doc/src/sgml/func/func-admin.sgml
@@ -688,6 +688,25 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_last_wal_write_lsn</primary>
+ </indexterm>
+ <function>pg_last_wal_write_lsn</function> ()
+ <returnvalue>pg_lsn</returnvalue>
+ </para>
+ <para>
+ Returns the last write-ahead log location that has been received and
+ written to disk by streaming replication, but not necessarily synced.
+ While streaming replication is in progress this will increase
+ monotonically. If recovery has completed then this will remain static
+ at the location of the last WAL record written during recovery. If
+ streaming replication is disabled, or if it has not yet started, the
+ function returns <literal>NULL</literal>.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e45fce43ed..46cd4a7ce2f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -347,6 +347,25 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
PG_RETURN_LSN(recptr);
}
+/*
+ * Report the last WAL write location (same format as pg_backup_start etc)
+ *
+ * This is useful for determining how much of WAL has been received and
+ * written to disk by walreceiver, but not necessarily synced/flushed.
+ */
+Datum
+pg_last_wal_write_lsn(PG_FUNCTION_ARGS)
+{
+ XLogRecPtr recptr;
+
+ recptr = GetWalRcvWriteRecPtr();
+
+ if (!XLogRecPtrIsValid(recptr))
+ PG_RETURN_NULL();
+
+ PG_RETURN_LSN(recptr);
+}
+
/*
* Report the last WAL replay location (same format as pg_backup_start etc)
*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66431940700..fcb674c05b3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6782,6 +6782,10 @@
proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
prorettype => 'pg_lsn', proargtypes => '',
prosrc => 'pg_last_wal_receive_lsn' },
+{ oid => '6434', descr => 'current wal write location',
+ proname => 'pg_last_wal_write_lsn', provolatile => 'v',
+ prorettype => 'pg_lsn', proargtypes => '',
+ prosrc => 'pg_last_wal_write_lsn' },
{ oid => '3821', descr => 'last wal replay location',
proname => 'pg_last_wal_replay_lsn', provolatile => 'v',
prorettype => 'pg_lsn', proargtypes => '',
--
2.51.0
v1-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v1-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From ca10a52bd7a835b2873268236a4553fc911e2de3 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 17:58:28 +0800
Subject: [PATCH v1 1/5] Extend xlogwait infrastructure with write and flush
wait types
Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.
Key changes:
- Add WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH_STANDBY to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
src/backend/access/transam/xlogwait.c | 79 ++++++++++++++-----
.../utils/activity/wait_event_names.txt | 3 +-
src/include/access/xlogwait.h | 7 +-
3 files changed, 67 insertions(+), 22 deletions(-)
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 98aa5f1e4a2..86709e0df63 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
* This file implements waiting for WAL operations to reach specific LSNs
* on both physical standby and primary servers. The core idea is simple:
* every process that wants to wait publishes the LSN it needs to the
- * shared memory, and the appropriate process (startup on standby, or
- * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ * shared memory, and the appropriate process (startup on standby,
+ * walreceiver on standby, or WAL writer/backend on primary) wakes it
+ * once that LSN has been reached.
*
* The shared memory used by this module comprises a procInfos
* per-backend array with the information of the awaited LSN for each
* of the backend processes. The elements of that array are organized
- * into a pairing heap waitersHeap, which allows for very fast finding
- * of the least awaited LSN.
+ * into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ * allows for very fast finding of the least awaited LSN for each type.
*
- * In addition, the least-awaited LSN is cached as minWaitedLSN. The
- * waiter process publishes information about itself to the shared
- * memory and waits on the latch until it is woken up by the appropriate
- * process, standby is promoted, or the postmaster dies. Then, it cleans
- * information about itself in the shared memory.
+ * In addition, the least-awaited LSN for each type is cached in the
+ * minWaitedLSN array. The waiter process publishes information about
+ * itself to the shared memory and waits on the latch until it is woken
+ * up by the appropriate process, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
*
- * On standby servers: After replaying a WAL record, the startup process
- * first performs a fast path check minWaitedLSN > replayLSN. If this
- * check is negative, it checks waitersHeap and wakes up the backend
- * whose awaited LSNs are reached.
+ * On standby servers:
+ * - After replaying a WAL record, the startup process performs a fast
+ * path check minWaitedLSN[REPLAY] > replayLSN. If this check is
+ * negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ * whose awaited LSNs are reached.
+ * - After receiving WAL, the walreceiver process performs similar checks
+ * against the flush and write LSNs, waking up waiters in the FLUSH
+ * and WRITE heaps respectively.
*
* On primary servers: After flushing WAL, the WAL writer or backend
* process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -62,6 +68,43 @@ static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
struct WaitLSNState *waitLSNState = NULL;
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+ [WAIT_LSN_TYPE_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+ [WAIT_LSN_TYPE_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+ [WAIT_LSN_TYPE_FLUSH_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+ [WAIT_LSN_TYPE_FLUSH_PRIMARY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+ switch (lsnType)
+ {
+ case WAIT_LSN_TYPE_REPLAY:
+ return GetXLogReplayRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_WRITE:
+ return GetWalRcvWriteRecPtr();
+
+ case WAIT_LSN_TYPE_FLUSH_STANDBY:
+ return GetWalRcvFlushRecPtr(NULL, NULL);
+
+ case WAIT_LSN_TYPE_FLUSH_PRIMARY:
+ return GetFlushRecPtr(NULL);
+
+ default:
+ elog(ERROR, "invalid LSN wait type: %d", lsnType);
+ return InvalidXLogRecPtr; /* keep compiler quiet */
+ }
+}
+
/* Report the amount of shared memory space needed for WaitLSNState. */
Size
WaitLSNShmemSize(void)
@@ -341,13 +384,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
int rc;
long delay_ms = -1;
- if (lsnType == WAIT_LSN_TYPE_REPLAY)
- currentLSN = GetXLogReplayRecPtr(NULL);
- else
- currentLSN = GetFlushRecPtr(NULL);
+ /* Get current LSN for the wait type */
+ currentLSN = GetCurrentLSNForWaitType(lsnType);
/* Check that recovery is still in-progress */
- if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+ if (lsnType != WAIT_LSN_TYPE_FLUSH_PRIMARY && !RecoveryInProgress())
{
/*
* Recovery was ended, but check if target LSN was already
@@ -376,7 +417,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
CHECK_FOR_INTERRUPTS();
rc = WaitLatch(MyLatch, wake_events, delay_ms,
- (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+ WaitLSNWaitEvents[lsnType]);
/*
* Emergency bailout if postmaster has died. This is to avoid the
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..fbcdb92dcfb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary or standby."
WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE "Waiting for WAL write to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index e607441d618..64a2fb02eac 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -36,8 +36,10 @@ typedef enum
typedef enum WaitLSNType
{
WAIT_LSN_TYPE_REPLAY = 0, /* Waiting for replay on standby */
- WAIT_LSN_TYPE_FLUSH = 1, /* Waiting for flush on primary */
- WAIT_LSN_TYPE_COUNT = 2
+ WAIT_LSN_TYPE_FLUSH_STANDBY = 1, /* Waiting for flush on standby */
+ WAIT_LSN_TYPE_WRITE = 2, /* Waiting for write on standby */
+ WAIT_LSN_TYPE_FLUSH_PRIMARY = 3, /* Waiting for flush on primary */
+ WAIT_LSN_TYPE_COUNT = 4
} WaitLSNType;
/*
@@ -96,6 +98,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
extern Size WaitLSNShmemSize(void);
extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
--
2.51.0
v1-0005-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v1-0005-Use-WAIT-FOR-LSN-in.patchDownload
From 6229917d4802a82bb63ac41ec32a7ca357701c67 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:20:18 +0800
Subject: [PATCH v1 5/5] Use WAIT FOR LSN in
PostgreSQL::Test::Cluster::wait_for_catchup()
Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.
The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
1 file changed, 32 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..b2a4e2e2253 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
$mode = defined($mode) ? $mode : 'replay';
my %valid_modes =
('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+ my $isrecovery =
+ $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+ chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
- my $isrecovery =
- $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
- chomp($isrecovery);
if ($isrecovery eq 't')
{
$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,35 @@ sub wait_for_catchup
. $self->name . "\n";
# Before release 12 walreceiver just set the application name to
# "walreceiver"
+
+ # Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+ # This is more efficient than polling pg_stat_replication
+ if (($mode ne 'sent') && ($isrecovery eq 't'))
+ {
+ my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+ # Map mode names to WAIT FOR LSN MODE values (uppercase)
+ my $wait_mode = uc($mode);
+ my $query =
+ qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+ my $output = $self->safe_psql('postgres', $query);
+ chomp($output);
+
+ if ($output ne 'success')
+ {
+ # Fetch additional detail for debugging purposes
+ $query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+ my $details = $self->safe_psql('postgres', $query);
+ diag qq(WAIT FOR LSN failed with status:
+${output});
+ diag qq(Last pg_stat_replication contents:
+${details});
+ croak "failed waiting for catchup";
+ }
+ print "done\n";
+ return;
+ }
+
+ # Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('$standby_name', 'walreceiver')];
--
2.51.0
v1-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v1-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From 071f67d1fae98e397c071dce0b9993b3be0c0e9f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:11:54 +0800
Subject: [PATCH v1 3/5] Add MODE parameter to WAIT FOR LSN command
Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN
The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.
The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.
Also includes:
- Documentation updates for the new syntax and refactoring
of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
doc/src/sgml/ref/wait_for.sgml | 184 ++++++++++++++++------
src/backend/access/transam/xlog.c | 6 +-
src/backend/commands/wait.c | 59 +++++--
src/backend/parser/gram.y | 21 ++-
src/backend/replication/walreceiver.c | 19 +++
src/include/nodes/parsenodes.h | 16 ++
src/include/parser/kwlist.h | 2 +
src/test/recovery/t/049_wait_for_lsn.pl | 201 +++++++++++++++++++++---
8 files changed, 422 insertions(+), 86 deletions(-)
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..efd851149c0 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
<refnamediv>
<refname>WAIT FOR</refname>
- <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
</refnamediv>
<refsynopsisdiv>
<synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+ [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Description</title>
<para>
- Waits until recovery replays <parameter>lsn</parameter>.
- If no <parameter>timeout</parameter> is specified or it is set to
- zero, this command waits indefinitely for the
- <parameter>lsn</parameter>.
- On timeout, or if the server is promoted before
- <parameter>lsn</parameter> is reached, an error is emitted,
- unless <literal>NO_THROW</literal> is specified in the WITH clause.
- If <parameter>NO_THROW</parameter> is specified, then the command
- doesn't throw errors.
+ Waits until the specified <parameter>lsn</parameter> is reached
+ according to the specified <parameter>mode</parameter>,
+ which determines whether to wait for WAL to be written, flushed, or replayed.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
</para>
<para>
- The possible return values are <literal>success</literal>,
- <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
</para>
</refsect1>
@@ -64,6 +67,53 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>MODE</literal></term>
+ <listitem>
+ <para>
+ Specifies the type of LSN processing to wait for. If not specified,
+ the default is <literal>REPLAY</literal>. The valid modes are:
+ </para>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>REPLAY</literal></term>
+ <listitem>
+ <para>
+ Wait for the LSN to be replayed (applied to the database).
+ After successful completion, <function>pg_last_wal_replay_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>FLUSH</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and flushed to durable storage on the replica. This
+ provides a durability guarantee without waiting for the WAL
+ to be applied.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WRITE</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and written to the operating system on the replica.
+ This is faster than <literal>FLUSH</literal> but provides weaker
+ durability guarantees since the data may still be in OS buffers.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
<listitem>
@@ -135,9 +185,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<listitem>
<para>
This return value denotes that the database server is not in a recovery
- state. This might mean either the database server was not in recovery
- at the moment of receiving the command, or it was promoted before
- reaching the target <parameter>lsn</parameter>.
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command (i.e., executed on a primary),
+ or it was promoted before reaching the target <parameter>lsn</parameter>.
+ In the promotion case, this status indicates a timeline change occurred,
+ and the application should re-evaluate whether the target LSN is still
+ relevant.
</para>
</listitem>
</varlistentry>
@@ -148,25 +201,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Notes</title>
<para>
- <command>WAIT FOR</command> command waits till
- <parameter>lsn</parameter> to be replayed on standby.
- That is, after this command execution, the value returned by
- <function>pg_last_wal_replay_lsn</function> should be greater or equal
- to the <parameter>lsn</parameter> value. This is useful to achieve
- read-your-writes-consistency, while using async replica for reads and
- primary for writes. In that case, the <acronym>lsn</acronym> of the last
- modification should be stored on the client application side or the
- connection pooler side.
+ <command>WAIT FOR</command> waits until the specified
+ <parameter>lsn</parameter> is reached according to the specified
+ <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+ for the LSN to be replayed (applied to the database), which is useful
+ to achieve read-your-writes consistency while using an async replica
+ for reads and the primary for writes. The <literal>FLUSH</literal> mode
+ waits for the WAL to be flushed to durable storage on the replica,
+ providing a durability guarantee without waiting for replay. The
+ <literal>WRITE</literal> mode waits for the WAL to be written to the
+ operating system, which is faster than flush but provides weaker
+ durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+ last modification should be stored on the client application side or
+ the connection pooler side.
</para>
<para>
- <command>WAIT FOR</command> command should be called on standby.
- If a user runs <command>WAIT FOR</command> on primary, it
- will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
- However, if <command>WAIT FOR</command> is
- called on primary promoted from standby and <literal>lsn</literal>
- was already replayed, then the <command>WAIT FOR</command> command just
- exits immediately.
+ <command>WAIT FOR</command> should be called on a standby.
+ If a user runs <command>WAIT FOR</command> on the primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified
+ in the WITH clause. However, if <command>WAIT FOR</command> is
+ called on a primary promoted from standby and <literal>lsn</literal>
+ was already reached, then the <command>WAIT FOR</command> command
+ just exits immediately. If the replica is promoted while waiting,
+ the command will return <literal>not in recovery</literal> (or throw
+ an error if <literal>NO_THROW</literal> is not specified). Promotion
+ creates a new timeline, and the LSN being waited for may refer to
+ WAL from the old timeline.
</para>
</refsect1>
@@ -175,21 +236,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Examples</title>
<para>
- You can use <command>WAIT FOR</command> command to wait for
- the <type>pg_lsn</type> value. For example, an application could update
- the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
- changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
- on primary server to get the <acronym>lsn</acronym> given that
- <varname>synchronous_commit</varname> could be set to
- <literal>off</literal>.
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
<programlisting>
postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
UPDATE 100
postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
(1 row)
</programlisting>
@@ -198,9 +259,9 @@ pg_current_wal_insert_lsn
changes made on primary should be guaranteed to be visible on replica.
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
status
---------
+---------
success
(1 row)
postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +272,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
</para>
<para>
- If the target LSN is not reached before the timeout, the error is thrown.
+ Wait for flush (data durable on replica):
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
</programlisting>
</para>
<para>
The same example uses <command>WAIT FOR</command> with
- <parameter>NO_THROW</parameter> option.
+ <parameter>NO_THROW</parameter> option:
+
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
status
---------
+---------
timeout
(1 row)
</programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..a4c7a7c2b38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6240,10 +6240,12 @@ StartupXLOG(void)
LWLockRelease(ControlFileLock);
/*
- * Wake up all waiters for replay LSN. They need to report an error that
- * recovery was ended before reaching the target LSN.
+ * Wake up all waiters. They need to report an error that recovery was
+ * ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_WRITE, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..73876ca5c7c 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
*
* wait.c
* Implements WAIT FOR, which allows waiting for events such as
- * time passing or LSN having been replayed on replica.
+ * time passing or LSN having been replayed, flushed, or written.
*
* Portions Copyright (c) 2025, PostgreSQL Global Development Group
*
@@ -15,6 +15,7 @@
#include <math.h>
+#include "access/xlog.h"
#include "access/xlogrecovery.h"
#include "access/xlogwait.h"
#include "commands/defrem.h"
@@ -28,12 +29,29 @@
#include "utils/snapmgr.h"
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+ const char *noun; /* "replay", "flush", "write" */
+ const char *verb; /* "replayed", "flushed", "written" */
+} WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+ [WAIT_LSN_TYPE_REPLAY] = {"replay", "replayed"},
+ [WAIT_LSN_TYPE_FLUSH_STANDBY] = {"flush", "flushed"},
+ [WAIT_LSN_TYPE_WRITE] = {"write", "written"},
+};
+
+
void
ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
{
XLogRecPtr lsn;
int64 timeout = 0;
WaitLSNResult waitLSNResult;
+ WaitLSNType lsnType;
bool throw = true;
TupleDesc tupdesc;
TupOutputState *tstate;
@@ -41,6 +59,16 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
bool timeout_specified = false;
bool no_throw_specified = false;
+ /*
+ * Convert parse-time WaitLSNMode to runtime WaitLSNType. Values are
+ * designed to match, so a simple cast is safe.
+ */
+ lsnType = (WaitLSNType) stmt->mode;
+
+ /* Validate mode value (should never fail if grammar is correct) */
+ Assert(lsnType >= WAIT_LSN_TYPE_REPLAY &&
+ lsnType < WAIT_LSN_TYPE_FLUSH_PRIMARY);
+
/* Parse and validate the mandatory LSN */
lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
CStringGetDatum(stmt->lsn_literal)));
@@ -107,8 +135,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
}
/*
- * We are going to wait for the LSN replay. We should first care that we
- * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * We are going to wait for the LSN. We should first care that we don't
+ * hold a snapshot and correspondingly our MyProc->xmin is invalid.
* Otherwise, our snapshot could prevent the replay of WAL records
* implying a kind of self-deadlock. This is the reason why WAIT FOR is a
* command, not a procedure or function.
@@ -140,7 +168,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
@@ -154,11 +182,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_TIMEOUT:
if (throw)
+ {
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
- errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
+ }
else
result = "timeout";
break;
@@ -166,20 +201,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errhint("Waiting for the replay LSN can only be executed during recovery."));
+ errhint("Waiting for the %s LSN can only be executed during recovery.",
+ desc->noun));
}
else
result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c3a0a354a9c..57b3e91893c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -640,6 +640,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <windef> window_definition over_clause window_specification
opt_frame_clause frame_extent frame_bound
%type <ival> null_treatment opt_window_exclusion_clause
+%type <ival> opt_wait_lsn_mode
%type <str> opt_existing_window_name
%type <boolean> opt_if_not_exists
%type <boolean> opt_unique_null_treatment
@@ -729,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
EXPRESSION EXTENSION EXTERNAL EXTRACT
- FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+ FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -770,7 +771,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -16489,15 +16490,23 @@ xml_passing_mech:
*****************************************************************************/
WaitStmt:
- WAIT FOR LSN_P Sconst opt_wait_with_clause
+ WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
{
WaitStmt *n = makeNode(WaitStmt);
n->lsn_literal = $4;
- n->options = $5;
+ n->mode = $5;
+ n->options = $6;
$$ = (Node *) n;
}
;
+opt_wait_lsn_mode:
+ MODE REPLAY { $$ = WAIT_LSN_MODE_REPLAY; }
+ | MODE FLUSH { $$ = WAIT_LSN_MODE_FLUSH; }
+ | MODE WRITE { $$ = WAIT_LSN_MODE_WRITE; }
+ | /*EMPTY*/ { $$ = WAIT_LSN_MODE_REPLAY; }
+ ;
+
opt_wait_with_clause:
WITH '(' utility_option_list ')' { $$ = $3; }
| /*EMPTY*/ { $$ = NIL; }
@@ -17937,6 +17946,7 @@ unreserved_keyword:
| FILTER
| FINALIZE
| FIRST_P
+ | FLUSH
| FOLLOWING
| FORCE
| FORMAT
@@ -18071,6 +18081,7 @@ unreserved_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESPECT_P
@@ -18524,6 +18535,7 @@ bare_label_keyword:
| FINALIZE
| FIRST_P
| FLOAT_P
+ | FLUSH
| FOLLOWING
| FORCE
| FOREIGN
@@ -18706,6 +18718,7 @@ bare_label_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4217fc54e2e..818049599ed 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "catalog/pg_authid.h"
#include "funcapi.h"
#include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
/* Update shared-memory status */
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+ /*
+ * If we wrote an LSN that someone was waiting for then walk over the
+ * shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Write >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_WRITE])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_WRITE, LogstreamResult.Write);
+
/*
* Close the current segment if it's fully written up in the last cycle of
* the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
}
SpinLockRelease(&walrcv->mutex);
+ /*
+ * If we flushed an LSN that someone was waiting for then walk over
+ * the shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Flush >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, LogstreamResult.Flush);
+
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d14294a4ece..68dc49dc2da 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,10 +4385,26 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ *
+ * These values are defined to match WaitLSNType in access/xlogwait.h
+ * for efficient conversion without overhead. The values must be kept
+ * in sync with WaitLSNType.
+ */
+typedef enum WaitLSNMode
+{
+ WAIT_LSN_MODE_REPLAY = 0, /* Wait for LSN replay on standby */
+ WAIT_LSN_MODE_FLUSH = 1, /* Wait for LSN flush to disk on standby */
+ WAIT_LSN_MODE_WRITE = 2 /* Wait for LSN write to WAL buffers on
+ * standby */
+} WaitLSNMode;
+
typedef struct WaitStmt
{
NodeTag type;
char *lsn_literal; /* LSN string from grammar */
+ WaitLSNMode mode; /* Wait mode: REPLAY/FLUSH/WRITE */
List *options; /* List of DefElem nodes */
} WaitStmt;
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 5d4fe27ef96..7ad8b11b725 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -378,6 +379,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..e579b98f019 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
# the WAIT FOR command.
use strict;
use warnings FATAL => 'all';
@@ -62,7 +62,40 @@ $output = $node_standby->safe_psql(
ok((split("\n", $output))[-1] eq 30,
"standby reached the same LSN as primary");
-# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$node_standby->safe_psql('postgres',
+ "WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');");
+
+# Verify via pg_stat_replication that standby reported the write
+my $standby_write_lsn = $node_primary->safe_psql(
+ 'postgres', qq[
+ SELECT write_lsn FROM pg_stat_replication
+ WHERE application_name = 'standby';
+]);
+
+ok( $node_primary->safe_psql('postgres',
+ "SELECT '${standby_write_lsn}'::pg_lsn >= '${lsn_write}'::pg_lsn") eq
+ 't',
+ "standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout. The
# unreachable LSN must be well in advance. So WAL records issued by
# the concurrent autovacuum could not affect that.
my $lsn3 =
@@ -88,7 +121,7 @@ $output = $node_standby->safe_psql(
WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
# within another function, or inside a transaction with an isolation level
# higher than READ COMMITTED.
@@ -125,7 +158,7 @@ ok( $stderr =~
/WAIT FOR must be only called without an active or registered snapshot/,
"get an error when running within another function");
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
my $test_lsn =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
@@ -208,7 +241,7 @@ $node_standby->psql(
ok( $stderr =~ /option "invalid_option" not recognized/,
"get error for invalid WITH clause option");
-# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters. We make 5 background
# psql sessions each waiting for a corresponding insertion. When waiting is
# finished, stored procedures logs if there are visible as many rows as
# should be.
@@ -239,7 +272,7 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
- WAIT FOR LSN '${lsn}';
+ WAIT FOR LSN '${lsn}' MODE REPLAY;
SELECT log_count(${i});
]);
}
@@ -251,23 +284,138 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->quit;
}
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+my @write_sessions;
+my @write_lsns;
+for (my $i = 0; $i < 3; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (100 + ${i});");
+ $write_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $write_sessions[$i] = $node_standby->background_psql('postgres');
+ $write_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+ ]);
+}
+
+# Wait for all WAIT FOR LSN commands to complete
+for (my $i = 0; $i < 3; $i++)
+{
+ $write_sessions[$i]->{run}->finish;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '$write_lsns[2]'::pg_lsn);");
-# 7. Check that the standby promotion terminates the wait on LSN. Start
-# waiting for an unreachable LSN then promote. Check the log for the relevant
-# error message. Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+ok($output >= 0,
+ "multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+my @flush_sessions;
+my @flush_lsns;
+for (my $i = 0; $i < 3; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (200 + ${i});");
+ $flush_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+ $flush_sessions[$i] = $node_standby->background_psql('postgres');
+ $flush_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+ ]);
+}
+
+# Wait for all WAIT FOR LSN commands to complete
+for (my $i = 0; $i < 3; $i++)
+{
+ $flush_sessions[$i]->{run}->finish;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[2]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently. We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN. When all complete, we verify that the replay LSN
+# (the slowest to advance due to recovery_min_apply_delay) has reached the
+# target. Since REPLAY waiters block until replay completes, and WRITE/FLUSH
+# complete earlier, successful completion of all sessions proves proper
+# coordination.
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+ $mixed_sessions[$i] = $node_standby->background_psql('postgres');
+ $mixed_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+ ]);
+}
+
+# Resume replay so REPLAY waiters can complete
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+
+# Wait for all sessions to complete - this blocks until WAIT FOR LSN returns
+for (my $i = 0; $i < 6; $i++)
+{
+ $mixed_sessions[$i]->{run}->finish;
+}
+
+# Verify: if all waiters completed, then the slowest (REPLAY) must have
+# reached the target LSN, which implies WRITE and FLUSH also succeeded
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "mixed mode waiters: all modes completed, replay reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes. Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote. Check the log for the relevant error messages. Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
my $lsn4 =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
my $lsn5 =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
- qr/start/, qq[
- \\echo start
- WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i] = $node_standby->background_psql('postgres');
+ $wait_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+ ]);
+}
# Make sure standby will be promoted at least at the primary insert LSN we
# have just observed. Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +425,23 @@ $node_primary->wait_for_catchup($node_standby);
$log_offset = -s $node_standby->logfile;
$node_standby->promote;
+
+# Wait for at least one "recovery is not in progress" error to appear
$node_standby->wait_for_log('recovery is not in progress', $log_offset);
-ok(1, 'got error after standby promote');
+# Verify all three sessions got the error by checking the log contains
+# the error message at least three times (from the promotion point)
+my $log_contents = slurp_file($node_standby->logfile, $log_offset);
+my $error_count = () = $log_contents =~ /recovery is not in progress/g;
+ok($error_count >= 3, 'promotion interrupted all wait modes');
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
ok(1, 'wait for already replayed LSN exits immediately even after promotion');
$output = $node_standby->safe_psql(
'postgres', qq[
- WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
ok($output eq "not in recovery",
"WAIT FOR returns correct status after standby promotion");
@@ -295,8 +449,11 @@ ok($output eq "not in recovery",
$node_standby->stop;
$node_primary->stop;
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
# already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i]->{run}->finish;
+}
done_testing();
--
2.51.0
v1-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v1-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 1bb41bdd83b37f9ef7237095a368ea21e589d262 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:18:06 +0800
Subject: [PATCH v1 4/5] Add tab completion for WAIT FOR LSN MODE parameter
Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 11 deletions(-)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 20d7a65c614..fcb9f19faef 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5313,10 +5313,11 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
/*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
* where option can be:
* TIMEOUT '<timeout>'
* NO_THROW
+ * MODE defaults to REPLAY if not specified.
*/
else if (Matches("WAIT"))
COMPLETE_WITH("FOR");
@@ -5325,25 +5326,41 @@ match_previous_words(int pattern_id,
else if (Matches("WAIT", "FOR", "LSN"))
/* No completion for LSN value - user must provide manually */
;
+
+ /*
+ * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+ * REPLAY
+ */
else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+ COMPLETE_WITH("MODE", "WITH");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+ COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
COMPLETE_WITH("WITH");
+ /* WITH directly after LSN (using default REPLAY mode) */
else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
COMPLETE_WITH("(");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+ COMPLETE_WITH("(");
+
+ /*
+ * Handle parenthesized option list (both with and without explicit MODE).
+ * This fires when we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as one
+ * word, so the above test is correct. timeout takes a string value,
+ * no_throw takes no value. We don't offer completions for these values.
+ */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
- /*
- * This fires if we're in an unfinished parenthesized option list.
- * get_previous_words treats a completed parenthesized option list as
- * one word, so the above test is correct.
- */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("timeout", "no_throw");
-
- /*
- * timeout takes a string value, no_throw takes no value. We don't
- * offer completions for these values.
- */
+ }
+ else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+ !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+ {
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("timeout", "no_throw");
}
/* WITH [RECURSIVE] */
--
2.51.0
Hi hackers,
On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi!
At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).------
Regards,
Alexander Korotkov
SupabaseMakes sense. I'll play with it and try to prepare a follow-up patch.
--
Best,
XunengIn terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.Syntax Option A: Mode in the WITH Clause
WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.Syntax Option B: Mode as Part of the Main Command
WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';Or a more concise variant using keywords:
WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.I've implemented a patch that adds MODE support to WAIT FOR LSN
The new grammar looks like:
——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——Two modes added: flush and write
Design decisions:
1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.
3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`
The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait typesAdds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.-------
0002 - Add pg_last_wal_write_lsn() SQL functionAdds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.-------
0003 - Add MODE parameter to WAIT FOR LSN commandExtends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.-------
0004 - Add tab completion for WAIT FOR LSN MODE parameterAdds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.Feedback welcome.
Here is the updated v2 patch set. Most of the updates are in patch 3.
Changes from v1:
Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)
Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat description
Patch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.
--
Best,
Xuneng
Attachments:
v2-0005-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v2-0005-Use-WAIT-FOR-LSN-in.patchDownload
From 02b633402db35770fd70ace6c1e6301f3dd6741b Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:20:18 +0800
Subject: [PATCH v2 5/5] Use WAIT FOR LSN in
PostgreSQL::Test::Cluster::wait_for_catchup()
Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.
The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
1 file changed, 32 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..b2a4e2e2253 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
$mode = defined($mode) ? $mode : 'replay';
my %valid_modes =
('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+ my $isrecovery =
+ $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+ chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
- my $isrecovery =
- $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
- chomp($isrecovery);
if ($isrecovery eq 't')
{
$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,35 @@ sub wait_for_catchup
. $self->name . "\n";
# Before release 12 walreceiver just set the application name to
# "walreceiver"
+
+ # Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+ # This is more efficient than polling pg_stat_replication
+ if (($mode ne 'sent') && ($isrecovery eq 't'))
+ {
+ my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+ # Map mode names to WAIT FOR LSN MODE values (uppercase)
+ my $wait_mode = uc($mode);
+ my $query =
+ qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+ my $output = $self->safe_psql('postgres', $query);
+ chomp($output);
+
+ if ($output ne 'success')
+ {
+ # Fetch additional detail for debugging purposes
+ $query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+ my $details = $self->safe_psql('postgres', $query);
+ diag qq(WAIT FOR LSN failed with status:
+${output});
+ diag qq(Last pg_stat_replication contents:
+${details});
+ croak "failed waiting for catchup";
+ }
+ print "done\n";
+ return;
+ }
+
+ # Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('$standby_name', 'walreceiver')];
--
2.51.0
v2-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v2-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 7fcaab3d495ccc42c3f9731d1de9a15c33c01ee8 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:18:06 +0800
Subject: [PATCH v2 4/5] Add tab completion for WAIT FOR LSN MODE parameter
Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 11 deletions(-)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 20d7a65c614..fcb9f19faef 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5313,10 +5313,11 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
/*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
* where option can be:
* TIMEOUT '<timeout>'
* NO_THROW
+ * MODE defaults to REPLAY if not specified.
*/
else if (Matches("WAIT"))
COMPLETE_WITH("FOR");
@@ -5325,25 +5326,41 @@ match_previous_words(int pattern_id,
else if (Matches("WAIT", "FOR", "LSN"))
/* No completion for LSN value - user must provide manually */
;
+
+ /*
+ * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+ * REPLAY
+ */
else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+ COMPLETE_WITH("MODE", "WITH");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+ COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
COMPLETE_WITH("WITH");
+ /* WITH directly after LSN (using default REPLAY mode) */
else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
COMPLETE_WITH("(");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+ COMPLETE_WITH("(");
+
+ /*
+ * Handle parenthesized option list (both with and without explicit MODE).
+ * This fires when we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as one
+ * word, so the above test is correct. timeout takes a string value,
+ * no_throw takes no value. We don't offer completions for these values.
+ */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
- /*
- * This fires if we're in an unfinished parenthesized option list.
- * get_previous_words treats a completed parenthesized option list as
- * one word, so the above test is correct.
- */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("timeout", "no_throw");
-
- /*
- * timeout takes a string value, no_throw takes no value. We don't
- * offer completions for these values.
- */
+ }
+ else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+ !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+ {
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("timeout", "no_throw");
}
/* WITH [RECURSIVE] */
--
2.51.0
v2-0002-Add-pg_last_wal_write_lsn-SQL-function.patchapplication/octet-stream; name=v2-0002-Add-pg_last_wal_write_lsn-SQL-function.patchDownload
From 9d22e09d378e8f6c52aa95bc4a0e1650f4621a39 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:07:52 +0800
Subject: [PATCH v2 2/5] Add pg_last_wal_write_lsn() SQL function
Returns the current WAL write position on a standby server using
GetWalRcvWriteRecPtr(). This enables verification of WAIT FOR LSN MODE WRITE
and operational monitoring of standby WAL write progress.
---
doc/src/sgml/func/func-admin.sgml | 22 ++++++++++++++++++++++
src/backend/access/transam/xlogfuncs.c | 20 ++++++++++++++++++++
src/include/catalog/pg_proc.dat | 4 ++++
3 files changed, 46 insertions(+)
diff --git a/doc/src/sgml/func/func-admin.sgml b/doc/src/sgml/func/func-admin.sgml
index 1b465bc8ba7..9ff196c4be4 100644
--- a/doc/src/sgml/func/func-admin.sgml
+++ b/doc/src/sgml/func/func-admin.sgml
@@ -688,6 +688,28 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_last_wal_write_lsn</primary>
+ </indexterm>
+ <function>pg_last_wal_write_lsn</function> ()
+ <returnvalue>pg_lsn</returnvalue>
+ </para>
+ <para>
+ Returns the last write-ahead log location that has been received and
+ passed to the operating system by streaming replication, but not
+ necessarily synced to durable storage. This is faster than
+ <function>pg_last_wal_receive_lsn</function> but provides weaker
+ durability guarantees since the data may still be in OS buffers.
+ While streaming replication is in progress this will increase
+ monotonically. If recovery has completed then this will remain static
+ at the location of the last WAL record written during recovery. If
+ streaming replication is disabled, or if it has not yet started, the
+ function returns <literal>NULL</literal>.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e45fce43ed..2797b2bf158 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -347,6 +347,26 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
PG_RETURN_LSN(recptr);
}
+/*
+ * Report the last WAL write location (same format as pg_backup_start etc)
+ *
+ * This is useful for determining how much of WAL has been received and
+ * passed to the operating system by walreceiver. Unlike pg_last_wal_receive_lsn,
+ * this data may still be in OS buffers and not yet synced to durable storage.
+ */
+Datum
+pg_last_wal_write_lsn(PG_FUNCTION_ARGS)
+{
+ XLogRecPtr recptr;
+
+ recptr = GetWalRcvWriteRecPtr();
+
+ if (!XLogRecPtrIsValid(recptr))
+ PG_RETURN_NULL();
+
+ PG_RETURN_LSN(recptr);
+}
+
/*
* Report the last WAL replay location (same format as pg_backup_start etc)
*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66431940700..478e0a8139f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6782,6 +6782,10 @@
proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
prorettype => 'pg_lsn', proargtypes => '',
prosrc => 'pg_last_wal_receive_lsn' },
+{ oid => '6434', descr => 'last wal write location on standby',
+ proname => 'pg_last_wal_write_lsn', provolatile => 'v',
+ prorettype => 'pg_lsn', proargtypes => '',
+ prosrc => 'pg_last_wal_write_lsn' },
{ oid => '3821', descr => 'last wal replay location',
proname => 'pg_last_wal_replay_lsn', provolatile => 'v',
prorettype => 'pg_lsn', proargtypes => '',
--
2.51.0
v2-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v2-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From da210bfc2b62d9a38ea54b94037380144753663a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 17:58:28 +0800
Subject: [PATCH v2 1/5] Extend xlogwait infrastructure with write and flush
wait types
Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.
Key changes:
- Add WAIT_LSN_TYPE_WRITE_STANDBY and WAIT_LSN_TYPE_FLUSH_STANDBY to WaitLSNType
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY → WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)
- Add GetCurrentLSNForWaitType() to retrieve current LSN
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
src/backend/access/transam/xlog.c | 2 +-
src/backend/access/transam/xlogrecovery.c | 4 +-
src/backend/access/transam/xlogwait.c | 84 ++++++++++++++-----
src/backend/commands/wait.c | 2 +-
.../utils/activity/wait_event_names.txt | 3 +-
src/include/access/xlogwait.h | 13 ++-
6 files changed, 81 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..4b145515269 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6243,7 +6243,7 @@ StartupXLOG(void)
* Wake up all waiters for replay LSN. They need to report an error that
* recovery was ended before reaching the target LSN.
*/
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..243c0b368a9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
*/
if (waitLSNState &&
(XLogRecoveryCtl->lastReplayedEndRecPtr >=
- pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, XLogRecoveryCtl->lastReplayedEndRecPtr);
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 98aa5f1e4a2..21823acee9c 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
* This file implements waiting for WAL operations to reach specific LSNs
* on both physical standby and primary servers. The core idea is simple:
* every process that wants to wait publishes the LSN it needs to the
- * shared memory, and the appropriate process (startup on standby, or
- * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ * shared memory, and the appropriate process (startup on standby,
+ * walreceiver on standby, or WAL writer/backend on primary) wakes it
+ * once that LSN has been reached.
*
* The shared memory used by this module comprises a procInfos
* per-backend array with the information of the awaited LSN for each
* of the backend processes. The elements of that array are organized
- * into a pairing heap waitersHeap, which allows for very fast finding
- * of the least awaited LSN.
+ * into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ * allows for very fast finding of the least awaited LSN for each type.
*
- * In addition, the least-awaited LSN is cached as minWaitedLSN. The
- * waiter process publishes information about itself to the shared
- * memory and waits on the latch until it is woken up by the appropriate
- * process, standby is promoted, or the postmaster dies. Then, it cleans
- * information about itself in the shared memory.
+ * In addition, the least-awaited LSN for each type is cached in the
+ * minWaitedLSN array. The waiter process publishes information about
+ * itself to the shared memory and waits on the latch until it is woken
+ * up by the appropriate process, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
*
- * On standby servers: After replaying a WAL record, the startup process
- * first performs a fast path check minWaitedLSN > replayLSN. If this
- * check is negative, it checks waitersHeap and wakes up the backend
- * whose awaited LSNs are reached.
+ * On standby servers:
+ * - After replaying a WAL record, the startup process performs a fast
+ * path check minWaitedLSN[REPLAY] > replayLSN. If this check is
+ * negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ * whose awaited LSNs are reached.
+ * - After receiving WAL, the walreceiver process performs similar checks
+ * against the flush and write LSNs, waking up waiters in the FLUSH
+ * and WRITE heaps respectively.
*
* On primary servers: After flushing WAL, the WAL writer or backend
* process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -62,6 +68,48 @@ static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
struct WaitLSNState *waitLSNState = NULL;
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+ [WAIT_LSN_TYPE_REPLAY_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+ [WAIT_LSN_TYPE_WRITE_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+ [WAIT_LSN_TYPE_FLUSH_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+ [WAIT_LSN_TYPE_FLUSH_PRIMARY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+ "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+ switch (lsnType)
+ {
+ case WAIT_LSN_TYPE_REPLAY_STANDBY:
+ return GetXLogReplayRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_WRITE_STANDBY:
+ return GetWalRcvWriteRecPtr();
+
+ case WAIT_LSN_TYPE_FLUSH_STANDBY:
+ return GetWalRcvFlushRecPtr(NULL, NULL);
+
+ case WAIT_LSN_TYPE_FLUSH_PRIMARY:
+ return GetFlushRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_COUNT:
+ break;
+ }
+
+ elog(ERROR, "invalid LSN wait type: %d", lsnType);
+ pg_unreachable();
+}
+
/* Report the amount of shared memory space needed for WaitLSNState. */
Size
WaitLSNShmemSize(void)
@@ -341,13 +389,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
int rc;
long delay_ms = -1;
- if (lsnType == WAIT_LSN_TYPE_REPLAY)
- currentLSN = GetXLogReplayRecPtr(NULL);
- else
- currentLSN = GetFlushRecPtr(NULL);
+ /* Get current LSN for the wait type */
+ currentLSN = GetCurrentLSNForWaitType(lsnType);
/* Check that recovery is still in-progress */
- if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+ if (lsnType != WAIT_LSN_TYPE_FLUSH_PRIMARY && !RecoveryInProgress())
{
/*
* Recovery was ended, but check if target LSN was already
@@ -376,7 +422,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
CHECK_FOR_INTERRUPTS();
rc = WaitLatch(MyLatch, wake_events, delay_ms,
- (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+ WaitLSNWaitEvents[lsnType]);
/*
* Emergency bailout if postmaster has died. This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..43b37095afb 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..fbcdb92dcfb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary or standby."
WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE "Waiting for WAL write to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index e607441d618..9721a7a7195 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,9 +35,15 @@ typedef enum
*/
typedef enum WaitLSNType
{
- WAIT_LSN_TYPE_REPLAY = 0, /* Waiting for replay on standby */
- WAIT_LSN_TYPE_FLUSH = 1, /* Waiting for flush on primary */
- WAIT_LSN_TYPE_COUNT = 2
+ /* Standby wait types (walreceiver/startup wakes) */
+ WAIT_LSN_TYPE_REPLAY_STANDBY = 0,
+ WAIT_LSN_TYPE_WRITE_STANDBY = 1,
+ WAIT_LSN_TYPE_FLUSH_STANDBY = 2,
+
+ /* Primary wait types (WAL writer/backends wake) */
+ WAIT_LSN_TYPE_FLUSH_PRIMARY = 3,
+
+ WAIT_LSN_TYPE_COUNT = 4
} WaitLSNType;
/*
@@ -96,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
extern Size WaitLSNShmemSize(void);
extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
--
2.51.0
v2-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v2-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From 1367c1f3322b93190fcd4ca70ab309efd8556c77 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:11:54 +0800
Subject: [PATCH v2] Add MODE parameter to WAIT FOR LSN command
Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN
The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.
The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.
Also includes:
- Documentation updates for the new syntax and refactoring
of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
doc/src/sgml/ref/wait_for.sgml | 188 +++++++++++----
src/backend/access/transam/xlog.c | 6 +-
src/backend/commands/wait.c | 64 ++++-
src/backend/parser/gram.y | 21 +-
src/backend/replication/walreceiver.c | 19 ++
src/include/nodes/parsenodes.h | 11 +
src/include/parser/kwlist.h | 2 +
src/test/recovery/t/049_wait_for_lsn.pl | 299 ++++++++++++++++++++++--
8 files changed, 523 insertions(+), 87 deletions(-)
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..a5e7f6c6fe9 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
<refnamediv>
<refname>WAIT FOR</refname>
- <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
</refnamediv>
<refsynopsisdiv>
<synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+ [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Description</title>
<para>
- Waits until recovery replays <parameter>lsn</parameter>.
- If no <parameter>timeout</parameter> is specified or it is set to
- zero, this command waits indefinitely for the
- <parameter>lsn</parameter>.
- On timeout, or if the server is promoted before
- <parameter>lsn</parameter> is reached, an error is emitted,
- unless <literal>NO_THROW</literal> is specified in the WITH clause.
- If <parameter>NO_THROW</parameter> is specified, then the command
- doesn't throw errors.
+ Waits until the specified <parameter>lsn</parameter> is reached
+ according to the specified <parameter>mode</parameter>,
+ which determines whether to wait for WAL to be written, flushed, or replayed.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
</para>
<para>
- The possible return values are <literal>success</literal>,
- <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
</para>
</refsect1>
@@ -64,6 +67,57 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>MODE</literal></term>
+ <listitem>
+ <para>
+ Specifies the type of LSN processing to wait for. If not specified,
+ the default is <literal>REPLAY</literal>. The valid modes are:
+ </para>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>REPLAY</literal></term>
+ <listitem>
+ <para>
+ Wait for the LSN to be replayed (applied to the database).
+ After successful completion, <function>pg_last_wal_replay_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>FLUSH</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and synced to durable storage via <function>fsync()</function>.
+ This provides a durability guarantee without waiting for the WAL
+ to be applied. After successful completion,
+ <function>pg_last_wal_receive_lsn()</function> will return a value
+ greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WRITE</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and passed to the operating system via <function>write()</function>.
+ This is faster than <literal>FLUSH</literal> but provides weaker
+ durability guarantees since the data may still be in OS buffers.
+ After successful completion, <function>pg_last_wal_write_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
<listitem>
@@ -135,9 +189,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<listitem>
<para>
This return value denotes that the database server is not in a recovery
- state. This might mean either the database server was not in recovery
- at the moment of receiving the command, or it was promoted before
- reaching the target <parameter>lsn</parameter>.
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command (i.e., executed on a primary),
+ or it was promoted before reaching the target <parameter>lsn</parameter>.
+ In the promotion case, this status indicates a timeline change occurred,
+ and the application should re-evaluate whether the target LSN is still
+ relevant.
</para>
</listitem>
</varlistentry>
@@ -148,25 +205,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Notes</title>
<para>
- <command>WAIT FOR</command> command waits till
- <parameter>lsn</parameter> to be replayed on standby.
- That is, after this command execution, the value returned by
- <function>pg_last_wal_replay_lsn</function> should be greater or equal
- to the <parameter>lsn</parameter> value. This is useful to achieve
- read-your-writes-consistency, while using async replica for reads and
- primary for writes. In that case, the <acronym>lsn</acronym> of the last
- modification should be stored on the client application side or the
- connection pooler side.
+ <command>WAIT FOR</command> waits until the specified
+ <parameter>lsn</parameter> is reached according to the specified
+ <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+ for the LSN to be replayed (applied to the database), which is useful
+ to achieve read-your-writes consistency while using an async replica
+ for reads and the primary for writes. The <literal>FLUSH</literal> mode
+ waits for the WAL to be flushed to durable storage on the replica,
+ providing a durability guarantee without waiting for replay. The
+ <literal>WRITE</literal> mode waits for the WAL to be written to the
+ operating system, which is faster than flush but provides weaker
+ durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+ last modification should be stored on the client application side or
+ the connection pooler side.
</para>
<para>
- <command>WAIT FOR</command> command should be called on standby.
- If a user runs <command>WAIT FOR</command> on primary, it
- will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
- However, if <command>WAIT FOR</command> is
- called on primary promoted from standby and <literal>lsn</literal>
- was already replayed, then the <command>WAIT FOR</command> command just
- exits immediately.
+ <command>WAIT FOR</command> should be called on a standby.
+ If a user runs <command>WAIT FOR</command> on the primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified
+ in the WITH clause. However, if <command>WAIT FOR</command> is
+ called on a primary promoted from standby and <literal>lsn</literal>
+ was already reached, then the <command>WAIT FOR</command> command
+ just exits immediately. If the replica is promoted while waiting,
+ the command will return <literal>not in recovery</literal> (or throw
+ an error if <literal>NO_THROW</literal> is not specified). Promotion
+ creates a new timeline, and the LSN being waited for may refer to
+ WAL from the old timeline.
</para>
</refsect1>
@@ -175,21 +240,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Examples</title>
<para>
- You can use <command>WAIT FOR</command> command to wait for
- the <type>pg_lsn</type> value. For example, an application could update
- the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
- changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
- on primary server to get the <acronym>lsn</acronym> given that
- <varname>synchronous_commit</varname> could be set to
- <literal>off</literal>.
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
<programlisting>
postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
UPDATE 100
postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
(1 row)
</programlisting>
@@ -198,9 +263,9 @@ pg_current_wal_insert_lsn
changes made on primary should be guaranteed to be visible on replica.
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
status
---------
+---------
success
(1 row)
postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +276,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
</para>
<para>
- If the target LSN is not reached before the timeout, the error is thrown.
+ Wait for flush (data durable on replica):
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
</programlisting>
</para>
<para>
The same example uses <command>WAIT FOR</command> with
- <parameter>NO_THROW</parameter> option.
+ <parameter>NO_THROW</parameter> option:
+
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
status
---------
+---------
timeout
(1 row)
</programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b145515269..5b2a262ff8e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6240,10 +6240,12 @@ StartupXLOG(void)
LWLockRelease(ControlFileLock);
/*
- * Wake up all waiters for replay LSN. They need to report an error that
- * recovery was ended before reaching the target LSN.
+ * Wake up all waiters. They need to report an error that recovery was
+ * ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 43b37095afb..05ad84fdb5b 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
*
* wait.c
* Implements WAIT FOR, which allows waiting for events such as
- * time passing or LSN having been replayed on replica.
+ * time passing or LSN having been replayed, flushed, or written.
*
* Portions Copyright (c) 2025, PostgreSQL Global Development Group
*
@@ -15,6 +15,7 @@
#include <math.h>
+#include "access/xlog.h"
#include "access/xlogrecovery.h"
#include "access/xlogwait.h"
#include "commands/defrem.h"
@@ -28,12 +29,28 @@
#include "utils/snapmgr.h"
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+ const char *noun; /* "replay", "flush", "write" */
+ const char *verb; /* "replayed", "flushed", "written" */
+} WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+ [WAIT_LSN_TYPE_REPLAY_STANDBY] = {"replay", "replayed"},
+ [WAIT_LSN_TYPE_WRITE_STANDBY] = {"write", "written"},
+ [WAIT_LSN_TYPE_FLUSH_STANDBY] = {"flush", "flushed"},
+};
+
void
ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
{
XLogRecPtr lsn;
int64 timeout = 0;
WaitLSNResult waitLSNResult;
+ WaitLSNType lsnType;
bool throw = true;
TupleDesc tupdesc;
TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
CStringGetDatum(stmt->lsn_literal)));
+ /* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+ switch (stmt->mode)
+ {
+ case WAIT_LSN_MODE_REPLAY:
+ lsnType = WAIT_LSN_TYPE_REPLAY_STANDBY;
+ break;
+ case WAIT_LSN_MODE_WRITE:
+ lsnType = WAIT_LSN_TYPE_WRITE_STANDBY;
+ break;
+ case WAIT_LSN_MODE_FLUSH:
+ lsnType = WAIT_LSN_TYPE_FLUSH_STANDBY;
+ break;
+ default:
+ elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+ }
+
foreach_node(DefElem, defel, stmt->options)
{
if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
}
/*
- * We are going to wait for the LSN replay. We should first care that we
- * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * We are going to wait for the LSN. We should first care that we don't
+ * hold a snapshot and correspondingly our MyProc->xmin is invalid.
* Otherwise, our snapshot could prevent the replay of WAL records
* implying a kind of self-deadlock. This is the reason why WAIT FOR is a
* command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
+ waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_TIMEOUT:
if (throw)
+ {
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
- errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
+ }
else
result = "timeout";
break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errhint("Waiting for the replay LSN can only be executed during recovery."));
+ errhint("Waiting for the %s LSN can only be executed during recovery.",
+ desc->noun));
}
else
result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c3a0a354a9c..57b3e91893c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -640,6 +640,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <windef> window_definition over_clause window_specification
opt_frame_clause frame_extent frame_bound
%type <ival> null_treatment opt_window_exclusion_clause
+%type <ival> opt_wait_lsn_mode
%type <str> opt_existing_window_name
%type <boolean> opt_if_not_exists
%type <boolean> opt_unique_null_treatment
@@ -729,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
EXPRESSION EXTENSION EXTERNAL EXTRACT
- FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+ FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -770,7 +771,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -16489,15 +16490,23 @@ xml_passing_mech:
*****************************************************************************/
WaitStmt:
- WAIT FOR LSN_P Sconst opt_wait_with_clause
+ WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
{
WaitStmt *n = makeNode(WaitStmt);
n->lsn_literal = $4;
- n->options = $5;
+ n->mode = $5;
+ n->options = $6;
$$ = (Node *) n;
}
;
+opt_wait_lsn_mode:
+ MODE REPLAY { $$ = WAIT_LSN_MODE_REPLAY; }
+ | MODE FLUSH { $$ = WAIT_LSN_MODE_FLUSH; }
+ | MODE WRITE { $$ = WAIT_LSN_MODE_WRITE; }
+ | /*EMPTY*/ { $$ = WAIT_LSN_MODE_REPLAY; }
+ ;
+
opt_wait_with_clause:
WITH '(' utility_option_list ')' { $$ = $3; }
| /*EMPTY*/ { $$ = NIL; }
@@ -17937,6 +17946,7 @@ unreserved_keyword:
| FILTER
| FINALIZE
| FIRST_P
+ | FLUSH
| FOLLOWING
| FORCE
| FORMAT
@@ -18071,6 +18081,7 @@ unreserved_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESPECT_P
@@ -18524,6 +18535,7 @@ bare_label_keyword:
| FINALIZE
| FIRST_P
| FLOAT_P
+ | FLUSH
| FOLLOWING
| FORCE
| FOREIGN
@@ -18706,6 +18718,7 @@ bare_label_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4217fc54e2e..be2971408e7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "catalog/pg_authid.h"
#include "funcapi.h"
#include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
/* Update shared-memory status */
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+ /*
+ * If we wrote an LSN that someone was waiting for then walk over the
+ * shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Write >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_WRITE_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, LogstreamResult.Write);
+
/*
* Close the current segment if it's fully written up in the last cycle of
* the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
}
SpinLockRelease(&walrcv->mutex);
+ /*
+ * If we flushed an LSN that someone was waiting for then walk over
+ * the shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Flush >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, LogstreamResult.Flush);
+
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d14294a4ece..bbaf3242ccb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,10 +4385,21 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+ WAIT_LSN_MODE_REPLAY, /* Wait for LSN replay on standby */
+ WAIT_LSN_MODE_WRITE, /* Wait for LSN write on standby */
+ WAIT_LSN_MODE_FLUSH /* Wait for LSN flush on standby */
+} WaitLSNMode;
+
typedef struct WaitStmt
{
NodeTag type;
char *lsn_literal; /* LSN string from grammar */
+ WaitLSNMode mode; /* Wait mode: REPLAY/FLUSH/WRITE */
List *options; /* List of DefElem nodes */
} WaitStmt;
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 5d4fe27ef96..7ad8b11b725 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -378,6 +379,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..6c9a463775b 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
# the WAIT FOR command.
use strict;
use warnings FATAL => 'all';
@@ -62,7 +62,34 @@ $output = $node_standby->safe_psql(
ok((split("\n", $output))[-1] eq 30,
"standby reached the same LSN as primary");
-# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout. The
# unreachable LSN must be well in advance. So WAL records issued by
# the concurrent autovacuum could not affect that.
my $lsn3 =
@@ -88,7 +115,7 @@ $output = $node_standby->safe_psql(
WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
# within another function, or inside a transaction with an isolation level
# higher than READ COMMITTED.
@@ -125,7 +152,7 @@ ok( $stderr =~
/WAIT FOR must be only called without an active or registered snapshot/,
"get an error when running within another function");
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
my $test_lsn =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
@@ -208,7 +235,7 @@ $node_standby->psql(
ok( $stderr =~ /option "invalid_option" not recognized/,
"get error for invalid WITH clause option");
-# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters. We make 5 background
# psql sessions each waiting for a corresponding insertion. When waiting is
# finished, stored procedures logs if there are visible as many rows as
# should be.
@@ -239,7 +266,7 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
- WAIT FOR LSN '${lsn}';
+ WAIT FOR LSN '${lsn}' MODE REPLAY;
SELECT log_count(${i});
]);
}
@@ -251,23 +278,239 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->quit;
}
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+my $orig_conninfo = $node_standby->safe_psql('postgres',
+ "SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+$node_standby->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '';
+ SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+ "SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 3; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (100 + ${i});");
+ $write_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+ $write_sessions[$i] = $node_standby->background_psql('postgres');
+ $write_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 3 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '$orig_conninfo';
+ SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+ "SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 3; $i++)
+{
+ $node_standby->wait_for_log("write_done $i", $write_log_offset);
+ $write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '$write_lsns[2]'::pg_lsn);");
+
+ok($output >= 0,
+ "multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+$node_standby->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '';
+ SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+ "SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 3; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (200 + ${i});");
+ $flush_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+ $flush_sessions[$i] = $node_standby->background_psql('postgres');
+ $flush_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 3 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '$orig_conninfo';
+ SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+ "SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 3; $i++)
+{
+ $node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+ $flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[2]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently. We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN. We stop the walreceiver and pause replay to
+# ensure all waiters block. Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+my $orig_conninfo_7d = $node_standby->safe_psql('postgres',
+ "SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+$node_standby->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '';
+ SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+ "SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+ $mixed_sessions[$i] = $node_standby->background_psql('postgres');
+ $mixed_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+ "SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+$node_standby->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '$orig_conninfo_7d';
+ SELECT pg_reload_conf();
+]);
+$node_standby->poll_query_until('postgres',
+ "SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+ $node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+ $mixed_sessions[$i]->quit;
+}
-# 7. Check that the standby promotion terminates the wait on LSN. Start
-# waiting for an unreachable LSN then promote. Check the log for the relevant
-# error message. Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+ "mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes. Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote. Check the log for the relevant error messages. Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
my $lsn4 =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
my $lsn5 =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
- qr/start/, qq[
- \\echo start
- WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i] = $node_standby->background_psql('postgres');
+ $wait_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+ ]);
+}
# Make sure standby will be promoted at least at the primary insert LSN we
# have just observed. Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +520,24 @@ $node_primary->wait_for_catchup($node_standby);
$log_offset = -s $node_standby->logfile;
$node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+ $log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+ $log_offset);
+$node_standby->wait_for_log(
+ qr/Recovery ended before target LSN.*was replayed/, $log_offset);
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
ok(1, 'wait for already replayed LSN exits immediately even after promotion');
$output = $node_standby->safe_psql(
'postgres', qq[
- WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
ok($output eq "not in recovery",
"WAIT FOR returns correct status after standby promotion");
@@ -295,8 +545,11 @@ ok($output eq "not in recovery",
$node_standby->stop;
$node_primary->stop;
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
# already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i]->{run}->finish;
+}
done_testing();
--
2.51.0
Hi,
On Mon, Dec 1, 2025 at 12:33 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi hackers,
On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi!
At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).------
Regards,
Alexander Korotkov
SupabaseMakes sense. I'll play with it and try to prepare a follow-up patch.
--
Best,
XunengIn terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.Syntax Option A: Mode in the WITH Clause
WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.Syntax Option B: Mode as Part of the Main Command
WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';Or a more concise variant using keywords:
WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.I've implemented a patch that adds MODE support to WAIT FOR LSN
The new grammar looks like:
——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——Two modes added: flush and write
Design decisions:
1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.
3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`
The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait typesAdds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.-------
0002 - Add pg_last_wal_write_lsn() SQL functionAdds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.-------
0003 - Add MODE parameter to WAIT FOR LSN commandExtends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.-------
0004 - Add tab completion for WAIT FOR LSN MODE parameterAdds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.Feedback welcome.
Here is the updated v2 patch set. Most of the updates are in patch 3.
Changes from v1:
Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat descriptionPatch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.--
Best,
Xuneng
Here is the updated v3. The changes are made to patch 3:
- Refactor duplicated TAP test code by extracting helper routines for
starting and stopping walreceiver.
- Increase the number of concurrent WRITE and FLUSH waiters in tests
7b and 7c from three to five, matching the number in test 7a.
--
Best,
Xuneng
Attachments:
v3-0005-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v3-0005-Use-WAIT-FOR-LSN-in.patchDownload
From 48f072498a128eb47f616e8c7e2621eb1ff2d831 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:20:18 +0800
Subject: [PATCH v3 5/5] Use WAIT FOR LSN in
PostgreSQL::Test::Cluster::wait_for_catchup()
Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.
The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
1 file changed, 32 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..b2a4e2e2253 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
$mode = defined($mode) ? $mode : 'replay';
my %valid_modes =
('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+ my $isrecovery =
+ $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+ chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
- my $isrecovery =
- $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
- chomp($isrecovery);
if ($isrecovery eq 't')
{
$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,35 @@ sub wait_for_catchup
. $self->name . "\n";
# Before release 12 walreceiver just set the application name to
# "walreceiver"
+
+ # Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+ # This is more efficient than polling pg_stat_replication
+ if (($mode ne 'sent') && ($isrecovery eq 't'))
+ {
+ my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+ # Map mode names to WAIT FOR LSN MODE values (uppercase)
+ my $wait_mode = uc($mode);
+ my $query =
+ qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+ my $output = $self->safe_psql('postgres', $query);
+ chomp($output);
+
+ if ($output ne 'success')
+ {
+ # Fetch additional detail for debugging purposes
+ $query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+ my $details = $self->safe_psql('postgres', $query);
+ diag qq(WAIT FOR LSN failed with status:
+${output});
+ diag qq(Last pg_stat_replication contents:
+${details});
+ croak "failed waiting for catchup";
+ }
+ print "done\n";
+ return;
+ }
+
+ # Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('$standby_name', 'walreceiver')];
--
2.51.0
v3-0002-Add-pg_last_wal_write_lsn-SQL-function.patchapplication/octet-stream; name=v3-0002-Add-pg_last_wal_write_lsn-SQL-function.patchDownload
From 9d22e09d378e8f6c52aa95bc4a0e1650f4621a39 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:07:52 +0800
Subject: [PATCH v3 2/5] Add pg_last_wal_write_lsn() SQL function
Returns the current WAL write position on a standby server using
GetWalRcvWriteRecPtr(). This enables verification of WAIT FOR LSN MODE WRITE
and operational monitoring of standby WAL write progress.
---
doc/src/sgml/func/func-admin.sgml | 22 ++++++++++++++++++++++
src/backend/access/transam/xlogfuncs.c | 20 ++++++++++++++++++++
src/include/catalog/pg_proc.dat | 4 ++++
3 files changed, 46 insertions(+)
diff --git a/doc/src/sgml/func/func-admin.sgml b/doc/src/sgml/func/func-admin.sgml
index 1b465bc8ba7..9ff196c4be4 100644
--- a/doc/src/sgml/func/func-admin.sgml
+++ b/doc/src/sgml/func/func-admin.sgml
@@ -688,6 +688,28 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
</para></entry>
</row>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ <indexterm>
+ <primary>pg_last_wal_write_lsn</primary>
+ </indexterm>
+ <function>pg_last_wal_write_lsn</function> ()
+ <returnvalue>pg_lsn</returnvalue>
+ </para>
+ <para>
+ Returns the last write-ahead log location that has been received and
+ passed to the operating system by streaming replication, but not
+ necessarily synced to durable storage. This is faster than
+ <function>pg_last_wal_receive_lsn</function> but provides weaker
+ durability guarantees since the data may still be in OS buffers.
+ While streaming replication is in progress this will increase
+ monotonically. If recovery has completed then this will remain static
+ at the location of the last WAL record written during recovery. If
+ streaming replication is disabled, or if it has not yet started, the
+ function returns <literal>NULL</literal>.
+ </para></entry>
+ </row>
+
<row>
<entry role="func_table_entry"><para role="func_signature">
<indexterm>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e45fce43ed..2797b2bf158 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -347,6 +347,26 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
PG_RETURN_LSN(recptr);
}
+/*
+ * Report the last WAL write location (same format as pg_backup_start etc)
+ *
+ * This is useful for determining how much of WAL has been received and
+ * passed to the operating system by walreceiver. Unlike pg_last_wal_receive_lsn,
+ * this data may still be in OS buffers and not yet synced to durable storage.
+ */
+Datum
+pg_last_wal_write_lsn(PG_FUNCTION_ARGS)
+{
+ XLogRecPtr recptr;
+
+ recptr = GetWalRcvWriteRecPtr();
+
+ if (!XLogRecPtrIsValid(recptr))
+ PG_RETURN_NULL();
+
+ PG_RETURN_LSN(recptr);
+}
+
/*
* Report the last WAL replay location (same format as pg_backup_start etc)
*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66431940700..478e0a8139f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6782,6 +6782,10 @@
proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
prorettype => 'pg_lsn', proargtypes => '',
prosrc => 'pg_last_wal_receive_lsn' },
+{ oid => '6434', descr => 'last wal write location on standby',
+ proname => 'pg_last_wal_write_lsn', provolatile => 'v',
+ prorettype => 'pg_lsn', proargtypes => '',
+ prosrc => 'pg_last_wal_write_lsn' },
{ oid => '3821', descr => 'last wal replay location',
proname => 'pg_last_wal_replay_lsn', provolatile => 'v',
prorettype => 'pg_lsn', proargtypes => '',
--
2.51.0
v3-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v3-0003-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From d51394bdfdf16e0d569a0e5843288c1a36b671a5 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:11:54 +0800
Subject: [PATCH v3 3/5] Add MODE parameter to WAIT FOR LSN command
Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN
The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.
The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.
Also includes:
- Documentation updates for the new syntax and refactoring
of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
doc/src/sgml/ref/wait_for.sgml | 188 +++++++++++----
src/backend/access/transam/xlog.c | 6 +-
src/backend/commands/wait.c | 64 +++++-
src/backend/parser/gram.y | 21 +-
src/backend/replication/walreceiver.c | 19 ++
src/include/nodes/parsenodes.h | 11 +
src/include/parser/kwlist.h | 2 +
src/test/recovery/t/049_wait_for_lsn.pl | 294 ++++++++++++++++++++++--
8 files changed, 518 insertions(+), 87 deletions(-)
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..a5e7f6c6fe9 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
<refnamediv>
<refname>WAIT FOR</refname>
- <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
</refnamediv>
<refsynopsisdiv>
<synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+ [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Description</title>
<para>
- Waits until recovery replays <parameter>lsn</parameter>.
- If no <parameter>timeout</parameter> is specified or it is set to
- zero, this command waits indefinitely for the
- <parameter>lsn</parameter>.
- On timeout, or if the server is promoted before
- <parameter>lsn</parameter> is reached, an error is emitted,
- unless <literal>NO_THROW</literal> is specified in the WITH clause.
- If <parameter>NO_THROW</parameter> is specified, then the command
- doesn't throw errors.
+ Waits until the specified <parameter>lsn</parameter> is reached
+ according to the specified <parameter>mode</parameter>,
+ which determines whether to wait for WAL to be written, flushed, or replayed.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
</para>
<para>
- The possible return values are <literal>success</literal>,
- <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
</para>
</refsect1>
@@ -64,6 +67,57 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>MODE</literal></term>
+ <listitem>
+ <para>
+ Specifies the type of LSN processing to wait for. If not specified,
+ the default is <literal>REPLAY</literal>. The valid modes are:
+ </para>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>REPLAY</literal></term>
+ <listitem>
+ <para>
+ Wait for the LSN to be replayed (applied to the database).
+ After successful completion, <function>pg_last_wal_replay_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>FLUSH</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and synced to durable storage via <function>fsync()</function>.
+ This provides a durability guarantee without waiting for the WAL
+ to be applied. After successful completion,
+ <function>pg_last_wal_receive_lsn()</function> will return a value
+ greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WRITE</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and passed to the operating system via <function>write()</function>.
+ This is faster than <literal>FLUSH</literal> but provides weaker
+ durability guarantees since the data may still be in OS buffers.
+ After successful completion, <function>pg_last_wal_write_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
<listitem>
@@ -135,9 +189,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<listitem>
<para>
This return value denotes that the database server is not in a recovery
- state. This might mean either the database server was not in recovery
- at the moment of receiving the command, or it was promoted before
- reaching the target <parameter>lsn</parameter>.
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command (i.e., executed on a primary),
+ or it was promoted before reaching the target <parameter>lsn</parameter>.
+ In the promotion case, this status indicates a timeline change occurred,
+ and the application should re-evaluate whether the target LSN is still
+ relevant.
</para>
</listitem>
</varlistentry>
@@ -148,25 +205,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Notes</title>
<para>
- <command>WAIT FOR</command> command waits till
- <parameter>lsn</parameter> to be replayed on standby.
- That is, after this command execution, the value returned by
- <function>pg_last_wal_replay_lsn</function> should be greater or equal
- to the <parameter>lsn</parameter> value. This is useful to achieve
- read-your-writes-consistency, while using async replica for reads and
- primary for writes. In that case, the <acronym>lsn</acronym> of the last
- modification should be stored on the client application side or the
- connection pooler side.
+ <command>WAIT FOR</command> waits until the specified
+ <parameter>lsn</parameter> is reached according to the specified
+ <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+ for the LSN to be replayed (applied to the database), which is useful
+ to achieve read-your-writes consistency while using an async replica
+ for reads and the primary for writes. The <literal>FLUSH</literal> mode
+ waits for the WAL to be flushed to durable storage on the replica,
+ providing a durability guarantee without waiting for replay. The
+ <literal>WRITE</literal> mode waits for the WAL to be written to the
+ operating system, which is faster than flush but provides weaker
+ durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+ last modification should be stored on the client application side or
+ the connection pooler side.
</para>
<para>
- <command>WAIT FOR</command> command should be called on standby.
- If a user runs <command>WAIT FOR</command> on primary, it
- will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
- However, if <command>WAIT FOR</command> is
- called on primary promoted from standby and <literal>lsn</literal>
- was already replayed, then the <command>WAIT FOR</command> command just
- exits immediately.
+ <command>WAIT FOR</command> should be called on a standby.
+ If a user runs <command>WAIT FOR</command> on the primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified
+ in the WITH clause. However, if <command>WAIT FOR</command> is
+ called on a primary promoted from standby and <literal>lsn</literal>
+ was already reached, then the <command>WAIT FOR</command> command
+ just exits immediately. If the replica is promoted while waiting,
+ the command will return <literal>not in recovery</literal> (or throw
+ an error if <literal>NO_THROW</literal> is not specified). Promotion
+ creates a new timeline, and the LSN being waited for may refer to
+ WAL from the old timeline.
</para>
</refsect1>
@@ -175,21 +240,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Examples</title>
<para>
- You can use <command>WAIT FOR</command> command to wait for
- the <type>pg_lsn</type> value. For example, an application could update
- the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
- changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
- on primary server to get the <acronym>lsn</acronym> given that
- <varname>synchronous_commit</varname> could be set to
- <literal>off</literal>.
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
<programlisting>
postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
UPDATE 100
postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
(1 row)
</programlisting>
@@ -198,9 +263,9 @@ pg_current_wal_insert_lsn
changes made on primary should be guaranteed to be visible on replica.
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
status
---------
+---------
success
(1 row)
postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +276,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
</para>
<para>
- If the target LSN is not reached before the timeout, the error is thrown.
+ Wait for flush (data durable on replica):
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
</programlisting>
</para>
<para>
The same example uses <command>WAIT FOR</command> with
- <parameter>NO_THROW</parameter> option.
+ <parameter>NO_THROW</parameter> option:
+
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
status
---------
+---------
timeout
(1 row)
</programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b145515269..5b2a262ff8e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6240,10 +6240,12 @@ StartupXLOG(void)
LWLockRelease(ControlFileLock);
/*
- * Wake up all waiters for replay LSN. They need to report an error that
- * recovery was ended before reaching the target LSN.
+ * Wake up all waiters. They need to report an error that recovery was
+ * ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 43b37095afb..05ad84fdb5b 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
*
* wait.c
* Implements WAIT FOR, which allows waiting for events such as
- * time passing or LSN having been replayed on replica.
+ * time passing or LSN having been replayed, flushed, or written.
*
* Portions Copyright (c) 2025, PostgreSQL Global Development Group
*
@@ -15,6 +15,7 @@
#include <math.h>
+#include "access/xlog.h"
#include "access/xlogrecovery.h"
#include "access/xlogwait.h"
#include "commands/defrem.h"
@@ -28,12 +29,28 @@
#include "utils/snapmgr.h"
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+ const char *noun; /* "replay", "flush", "write" */
+ const char *verb; /* "replayed", "flushed", "written" */
+} WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+ [WAIT_LSN_TYPE_REPLAY_STANDBY] = {"replay", "replayed"},
+ [WAIT_LSN_TYPE_WRITE_STANDBY] = {"write", "written"},
+ [WAIT_LSN_TYPE_FLUSH_STANDBY] = {"flush", "flushed"},
+};
+
void
ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
{
XLogRecPtr lsn;
int64 timeout = 0;
WaitLSNResult waitLSNResult;
+ WaitLSNType lsnType;
bool throw = true;
TupleDesc tupdesc;
TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
CStringGetDatum(stmt->lsn_literal)));
+ /* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+ switch (stmt->mode)
+ {
+ case WAIT_LSN_MODE_REPLAY:
+ lsnType = WAIT_LSN_TYPE_REPLAY_STANDBY;
+ break;
+ case WAIT_LSN_MODE_WRITE:
+ lsnType = WAIT_LSN_TYPE_WRITE_STANDBY;
+ break;
+ case WAIT_LSN_MODE_FLUSH:
+ lsnType = WAIT_LSN_TYPE_FLUSH_STANDBY;
+ break;
+ default:
+ elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+ }
+
foreach_node(DefElem, defel, stmt->options)
{
if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
}
/*
- * We are going to wait for the LSN replay. We should first care that we
- * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * We are going to wait for the LSN. We should first care that we don't
+ * hold a snapshot and correspondingly our MyProc->xmin is invalid.
* Otherwise, our snapshot could prevent the replay of WAL records
* implying a kind of self-deadlock. This is the reason why WAIT FOR is a
* command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
+ waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_TIMEOUT:
if (throw)
+ {
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
- errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
+ }
else
result = "timeout";
break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errhint("Waiting for the replay LSN can only be executed during recovery."));
+ errhint("Waiting for the %s LSN can only be executed during recovery.",
+ desc->noun));
}
else
result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c3a0a354a9c..57b3e91893c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -640,6 +640,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <windef> window_definition over_clause window_specification
opt_frame_clause frame_extent frame_bound
%type <ival> null_treatment opt_window_exclusion_clause
+%type <ival> opt_wait_lsn_mode
%type <str> opt_existing_window_name
%type <boolean> opt_if_not_exists
%type <boolean> opt_unique_null_treatment
@@ -729,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
EXPRESSION EXTENSION EXTERNAL EXTRACT
- FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+ FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -770,7 +771,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -16489,15 +16490,23 @@ xml_passing_mech:
*****************************************************************************/
WaitStmt:
- WAIT FOR LSN_P Sconst opt_wait_with_clause
+ WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
{
WaitStmt *n = makeNode(WaitStmt);
n->lsn_literal = $4;
- n->options = $5;
+ n->mode = $5;
+ n->options = $6;
$$ = (Node *) n;
}
;
+opt_wait_lsn_mode:
+ MODE REPLAY { $$ = WAIT_LSN_MODE_REPLAY; }
+ | MODE FLUSH { $$ = WAIT_LSN_MODE_FLUSH; }
+ | MODE WRITE { $$ = WAIT_LSN_MODE_WRITE; }
+ | /*EMPTY*/ { $$ = WAIT_LSN_MODE_REPLAY; }
+ ;
+
opt_wait_with_clause:
WITH '(' utility_option_list ')' { $$ = $3; }
| /*EMPTY*/ { $$ = NIL; }
@@ -17937,6 +17946,7 @@ unreserved_keyword:
| FILTER
| FINALIZE
| FIRST_P
+ | FLUSH
| FOLLOWING
| FORCE
| FORMAT
@@ -18071,6 +18081,7 @@ unreserved_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESPECT_P
@@ -18524,6 +18535,7 @@ bare_label_keyword:
| FINALIZE
| FIRST_P
| FLOAT_P
+ | FLUSH
| FOLLOWING
| FORCE
| FOREIGN
@@ -18706,6 +18718,7 @@ bare_label_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4217fc54e2e..be2971408e7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "catalog/pg_authid.h"
#include "funcapi.h"
#include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
/* Update shared-memory status */
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+ /*
+ * If we wrote an LSN that someone was waiting for then walk over the
+ * shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Write >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_WRITE_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, LogstreamResult.Write);
+
/*
* Close the current segment if it's fully written up in the last cycle of
* the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
}
SpinLockRelease(&walrcv->mutex);
+ /*
+ * If we flushed an LSN that someone was waiting for then walk over
+ * the shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Flush >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, LogstreamResult.Flush);
+
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d14294a4ece..bbaf3242ccb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,10 +4385,21 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+ WAIT_LSN_MODE_REPLAY, /* Wait for LSN replay on standby */
+ WAIT_LSN_MODE_WRITE, /* Wait for LSN write on standby */
+ WAIT_LSN_MODE_FLUSH /* Wait for LSN flush on standby */
+} WaitLSNMode;
+
typedef struct WaitStmt
{
NodeTag type;
char *lsn_literal; /* LSN string from grammar */
+ WaitLSNMode mode; /* Wait mode: REPLAY/FLUSH/WRITE */
List *options; /* List of DefElem nodes */
} WaitStmt;
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 5d4fe27ef96..7ad8b11b725 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -378,6 +379,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..ee3f2bf30d6 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
# the WAIT FOR command.
use strict;
use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+ my ($node) = @_;
+ $saved_primary_conninfo = $node->safe_psql('postgres',
+ "SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+ my ($node) = @_;
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
# Initialize primary node
my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
ok((split("\n", $output))[-1] eq 30,
"standby reached the same LSN as primary");
-# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout. The
# unreachable LSN must be well in advance. So WAL records issued by
# the concurrent autovacuum could not affect that.
my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
# within another function, or inside a transaction with an isolation level
# higher than READ COMMITTED.
@@ -125,7 +184,7 @@ ok( $stderr =~
/WAIT FOR must be only called without an active or registered snapshot/,
"get an error when running within another function");
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
my $test_lsn =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
@@ -208,7 +267,7 @@ $node_standby->psql(
ok( $stderr =~ /option "invalid_option" not recognized/,
"get error for invalid WITH clause option");
-# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters. We make 5 background
# psql sessions each waiting for a corresponding insertion. When waiting is
# finished, stored procedures logs if there are visible as many rows as
# should be.
@@ -226,7 +285,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
\$\$
LANGUAGE plpgsql;
]);
+
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
my @psql_sessions;
for (my $i = 0; $i < 5; $i++)
{
@@ -239,10 +300,11 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
- WAIT FOR LSN '${lsn}';
+ WAIT FOR LSN '${lsn}' MODE REPLAY;
SELECT log_count(${i});
]);
}
+
my $log_offset = -s $node_standby->logfile;
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
for (my $i = 0; $i < 5; $i++)
@@ -251,23 +313,199 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->quit;
}
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (100 + ${i});");
+ $write_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $write_sessions[$i] = $node_standby->background_psql('postgres');
+ $write_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("write_done $i", $write_log_offset);
+ $write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '$write_lsns[4]'::pg_lsn);");
+
+ok($output >= 0,
+ "multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (200 + ${i});");
+ $flush_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $flush_sessions[$i] = $node_standby->background_psql('postgres');
+ $flush_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+ $flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently. We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN. We stop the walreceiver and pause replay to
+# ensure all waiters block. Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+ $mixed_sessions[$i] = $node_standby->background_psql('postgres');
+ $mixed_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
-# 7. Check that the standby promotion terminates the wait on LSN. Start
-# waiting for an unreachable LSN then promote. Check the log for the relevant
-# error message. Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+ "SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+ $node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+ $mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ SELECT pg_lsn_cmp(pg_last_wal_write_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+ "mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes. Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote. Check the log for the relevant error messages. Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
my $lsn4 =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
my $lsn5 =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
- qr/start/, qq[
- \\echo start
- WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i] = $node_standby->background_psql('postgres');
+ $wait_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+ ]);
+}
# Make sure standby will be promoted at least at the primary insert LSN we
# have just observed. Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +515,24 @@ $node_primary->wait_for_catchup($node_standby);
$log_offset = -s $node_standby->logfile;
$node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+ $log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+ $log_offset);
+$node_standby->wait_for_log(
+ qr/Recovery ended before target LSN.*was replayed/, $log_offset);
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
ok(1, 'wait for already replayed LSN exits immediately even after promotion');
$output = $node_standby->safe_psql(
'postgres', qq[
- WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
ok($output eq "not in recovery",
"WAIT FOR returns correct status after standby promotion");
@@ -295,8 +540,11 @@ ok($output eq "not in recovery",
$node_standby->stop;
$node_primary->stop;
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
# already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i]->{run}->finish;
+}
done_testing();
--
2.51.0
v3-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v3-0004-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 51624191461fe702522c315d9da7a68da48a4b13 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:18:06 +0800
Subject: [PATCH v3 4/5] Add tab completion for WAIT FOR LSN MODE parameter
Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 11 deletions(-)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 20d7a65c614..fcb9f19faef 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5313,10 +5313,11 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
/*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
* where option can be:
* TIMEOUT '<timeout>'
* NO_THROW
+ * MODE defaults to REPLAY if not specified.
*/
else if (Matches("WAIT"))
COMPLETE_WITH("FOR");
@@ -5325,25 +5326,41 @@ match_previous_words(int pattern_id,
else if (Matches("WAIT", "FOR", "LSN"))
/* No completion for LSN value - user must provide manually */
;
+
+ /*
+ * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+ * REPLAY
+ */
else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+ COMPLETE_WITH("MODE", "WITH");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+ COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
COMPLETE_WITH("WITH");
+ /* WITH directly after LSN (using default REPLAY mode) */
else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
COMPLETE_WITH("(");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+ COMPLETE_WITH("(");
+
+ /*
+ * Handle parenthesized option list (both with and without explicit MODE).
+ * This fires when we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as one
+ * word, so the above test is correct. timeout takes a string value,
+ * no_throw takes no value. We don't offer completions for these values.
+ */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
- /*
- * This fires if we're in an unfinished parenthesized option list.
- * get_previous_words treats a completed parenthesized option list as
- * one word, so the above test is correct.
- */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("timeout", "no_throw");
-
- /*
- * timeout takes a string value, no_throw takes no value. We don't
- * offer completions for these values.
- */
+ }
+ else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+ !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+ {
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("timeout", "no_throw");
}
/* WITH [RECURSIVE] */
--
2.51.0
v3-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v3-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From da210bfc2b62d9a38ea54b94037380144753663a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 17:58:28 +0800
Subject: [PATCH v3 1/5] Extend xlogwait infrastructure with write and flush
wait types
Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.
Key changes:
- Add WAIT_LSN_TYPE_WRITE_STANDBY and WAIT_LSN_TYPE_FLUSH_STANDBY to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
src/backend/access/transam/xlog.c | 2 +-
src/backend/access/transam/xlogrecovery.c | 4 +-
src/backend/access/transam/xlogwait.c | 84 ++++++++++++++-----
src/backend/commands/wait.c | 2 +-
.../utils/activity/wait_event_names.txt | 3 +-
src/include/access/xlogwait.h | 13 ++-
6 files changed, 81 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..4b145515269 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6243,7 +6243,7 @@ StartupXLOG(void)
* Wake up all waiters for replay LSN. They need to report an error that
* recovery was ended before reaching the target LSN.
*/
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..243c0b368a9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
*/
if (waitLSNState &&
(XLogRecoveryCtl->lastReplayedEndRecPtr >=
- pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, XLogRecoveryCtl->lastReplayedEndRecPtr);
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 98aa5f1e4a2..21823acee9c 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
* This file implements waiting for WAL operations to reach specific LSNs
* on both physical standby and primary servers. The core idea is simple:
* every process that wants to wait publishes the LSN it needs to the
- * shared memory, and the appropriate process (startup on standby, or
- * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ * shared memory, and the appropriate process (startup on standby,
+ * walreceiver on standby, or WAL writer/backend on primary) wakes it
+ * once that LSN has been reached.
*
* The shared memory used by this module comprises a procInfos
* per-backend array with the information of the awaited LSN for each
* of the backend processes. The elements of that array are organized
- * into a pairing heap waitersHeap, which allows for very fast finding
- * of the least awaited LSN.
+ * into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ * allows for very fast finding of the least awaited LSN for each type.
*
- * In addition, the least-awaited LSN is cached as minWaitedLSN. The
- * waiter process publishes information about itself to the shared
- * memory and waits on the latch until it is woken up by the appropriate
- * process, standby is promoted, or the postmaster dies. Then, it cleans
- * information about itself in the shared memory.
+ * In addition, the least-awaited LSN for each type is cached in the
+ * minWaitedLSN array. The waiter process publishes information about
+ * itself to the shared memory and waits on the latch until it is woken
+ * up by the appropriate process, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
*
- * On standby servers: After replaying a WAL record, the startup process
- * first performs a fast path check minWaitedLSN > replayLSN. If this
- * check is negative, it checks waitersHeap and wakes up the backend
- * whose awaited LSNs are reached.
+ * On standby servers:
+ * - After replaying a WAL record, the startup process performs a fast
+ * path check minWaitedLSN[REPLAY] > replayLSN. If this check is
+ * negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ * whose awaited LSNs are reached.
+ * - After receiving WAL, the walreceiver process performs similar checks
+ * against the flush and write LSNs, waking up waiters in the FLUSH
+ * and WRITE heaps respectively.
*
* On primary servers: After flushing WAL, the WAL writer or backend
* process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -62,6 +68,48 @@ static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
struct WaitLSNState *waitLSNState = NULL;
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+ [WAIT_LSN_TYPE_REPLAY_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+ [WAIT_LSN_TYPE_WRITE_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+ [WAIT_LSN_TYPE_FLUSH_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+ [WAIT_LSN_TYPE_FLUSH_PRIMARY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+ "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+ switch (lsnType)
+ {
+ case WAIT_LSN_TYPE_REPLAY_STANDBY:
+ return GetXLogReplayRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_WRITE_STANDBY:
+ return GetWalRcvWriteRecPtr();
+
+ case WAIT_LSN_TYPE_FLUSH_STANDBY:
+ return GetWalRcvFlushRecPtr(NULL, NULL);
+
+ case WAIT_LSN_TYPE_FLUSH_PRIMARY:
+ return GetFlushRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_COUNT:
+ break;
+ }
+
+ elog(ERROR, "invalid LSN wait type: %d", lsnType);
+ pg_unreachable();
+}
+
/* Report the amount of shared memory space needed for WaitLSNState. */
Size
WaitLSNShmemSize(void)
@@ -341,13 +389,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
int rc;
long delay_ms = -1;
- if (lsnType == WAIT_LSN_TYPE_REPLAY)
- currentLSN = GetXLogReplayRecPtr(NULL);
- else
- currentLSN = GetFlushRecPtr(NULL);
+ /* Get current LSN for the wait type */
+ currentLSN = GetCurrentLSNForWaitType(lsnType);
/* Check that recovery is still in-progress */
- if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+ if (lsnType != WAIT_LSN_TYPE_FLUSH_PRIMARY && !RecoveryInProgress())
{
/*
* Recovery was ended, but check if target LSN was already
@@ -376,7 +422,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
CHECK_FOR_INTERRUPTS();
rc = WaitLatch(MyLatch, wake_events, delay_ms,
- (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+ WaitLSNWaitEvents[lsnType]);
/*
* Emergency bailout if postmaster has died. This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..43b37095afb 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..fbcdb92dcfb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary or standby."
WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE "Waiting for WAL write to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index e607441d618..9721a7a7195 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,9 +35,15 @@ typedef enum
*/
typedef enum WaitLSNType
{
- WAIT_LSN_TYPE_REPLAY = 0, /* Waiting for replay on standby */
- WAIT_LSN_TYPE_FLUSH = 1, /* Waiting for flush on primary */
- WAIT_LSN_TYPE_COUNT = 2
+ /* Standby wait types (walreceiver/startup wakes) */
+ WAIT_LSN_TYPE_REPLAY_STANDBY = 0,
+ WAIT_LSN_TYPE_WRITE_STANDBY = 1,
+ WAIT_LSN_TYPE_FLUSH_STANDBY = 2,
+
+ /* Primary wait types (WAL writer/backends wake) */
+ WAIT_LSN_TYPE_FLUSH_PRIMARY = 3,
+
+ WAIT_LSN_TYPE_COUNT = 4
} WaitLSNType;
/*
@@ -96,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
extern Size WaitLSNShmemSize(void);
extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
--
2.51.0
Hi,
On Tue, Dec 2, 2025 at 11:08 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Mon, Dec 1, 2025 at 12:33 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi hackers,
On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi!
At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).------
Regards,
Alexander Korotkov
SupabaseMakes sense. I'll play with it and try to prepare a follow-up patch.
--
Best,
XunengIn terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.Syntax Option A: Mode in the WITH Clause
WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.Syntax Option B: Mode as Part of the Main Command
WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';Or a more concise variant using keywords:
WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.I've implemented a patch that adds MODE support to WAIT FOR LSN
The new grammar looks like:
——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——Two modes added: flush and write
Design decisions:
1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.
3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`
The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait typesAdds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.-------
0002 - Add pg_last_wal_write_lsn() SQL functionAdds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.-------
0003 - Add MODE parameter to WAIT FOR LSN commandExtends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.-------
0004 - Add tab completion for WAIT FOR LSN MODE parameterAdds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.Feedback welcome.
Here is the updated v2 patch set. Most of the updates are in patch 3.
Changes from v1:
Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat descriptionPatch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.--
Best,
XunengHere is the updated v3. The changes are made to patch 3:
- Refactor duplicated TAP test code by extracting helper routines for
starting and stopping walreceiver.
- Increase the number of concurrent WRITE and FLUSH waiters in tests
7b and 7c from three to five, matching the number in test 7a.--
Best,
Xuneng
Just realized that patch 2 in prior emails could be dropped for
simplicity. Since the write LSN can be retrieved directly from
pg_stat_wal_receiver, the TAP test in patch 3 does not require a
separate SQL function for this purpose alone.
--
Best,
Xuneng
Attachments:
v4-0004-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v4-0004-Use-WAIT-FOR-LSN-in.patchDownload
From 56044afa03fe5732460c8de28039915133137602 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:20:18 +0800
Subject: [PATCH v4 4/4] Use WAIT FOR LSN in
PostgreSQL::Test::Cluster::wait_for_catchup()
Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.
The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
1 file changed, 32 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..b2a4e2e2253 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3328,6 +3328,9 @@ sub wait_for_catchup
$mode = defined($mode) ? $mode : 'replay';
my %valid_modes =
('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+ my $isrecovery =
+ $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+ chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3340,9 +3343,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
- my $isrecovery =
- $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
- chomp($isrecovery);
if ($isrecovery eq 't')
{
$target_lsn = $self->lsn('replay');
@@ -3360,6 +3360,35 @@ sub wait_for_catchup
. $self->name . "\n";
# Before release 12 walreceiver just set the application name to
# "walreceiver"
+
+ # Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+ # This is more efficient than polling pg_stat_replication
+ if (($mode ne 'sent') && ($isrecovery eq 't'))
+ {
+ my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+ # Map mode names to WAIT FOR LSN MODE values (uppercase)
+ my $wait_mode = uc($mode);
+ my $query =
+ qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+ my $output = $self->safe_psql('postgres', $query);
+ chomp($output);
+
+ if ($output ne 'success')
+ {
+ # Fetch additional detail for debugging purposes
+ $query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+ my $details = $self->safe_psql('postgres', $query);
+ diag qq(WAIT FOR LSN failed with status:
+${output});
+ diag qq(Last pg_stat_replication contents:
+${details});
+ croak "failed waiting for catchup";
+ }
+ print "done\n";
+ return;
+ }
+
+ # Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('$standby_name', 'walreceiver')];
--
2.51.0
v4-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v4-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From c0748a75838fe9281a15f56976f3059596943fd3 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:11:54 +0800
Subject: [PATCH v4 2/4] Add MODE parameter to WAIT FOR LSN command
Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN
The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.
The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.
Also includes:
- Documentation updates for the new syntax and refactoring
of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
doc/src/sgml/ref/wait_for.sgml | 192 ++++++++++++----
src/backend/access/transam/xlog.c | 6 +-
src/backend/commands/wait.c | 64 +++++-
src/backend/parser/gram.y | 21 +-
src/backend/replication/walreceiver.c | 19 ++
src/include/nodes/parsenodes.h | 11 +
src/include/parser/kwlist.h | 2 +
src/test/recovery/t/049_wait_for_lsn.pl | 294 ++++++++++++++++++++++--
8 files changed, 522 insertions(+), 87 deletions(-)
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..28c68678315 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
<refnamediv>
<refname>WAIT FOR</refname>
- <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
</refnamediv>
<refsynopsisdiv>
<synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+ [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Description</title>
<para>
- Waits until recovery replays <parameter>lsn</parameter>.
- If no <parameter>timeout</parameter> is specified or it is set to
- zero, this command waits indefinitely for the
- <parameter>lsn</parameter>.
- On timeout, or if the server is promoted before
- <parameter>lsn</parameter> is reached, an error is emitted,
- unless <literal>NO_THROW</literal> is specified in the WITH clause.
- If <parameter>NO_THROW</parameter> is specified, then the command
- doesn't throw errors.
+ Waits until the specified <parameter>lsn</parameter> is reached
+ according to the specified <parameter>mode</parameter>,
+ which determines whether to wait for WAL to be written, flushed, or replayed.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
</para>
<para>
- The possible return values are <literal>success</literal>,
- <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
</para>
</refsect1>
@@ -64,6 +67,61 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>MODE</literal></term>
+ <listitem>
+ <para>
+ Specifies the type of LSN processing to wait for. If not specified,
+ the default is <literal>REPLAY</literal>. The valid modes are:
+ </para>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>REPLAY</literal></term>
+ <listitem>
+ <para>
+ Wait for the LSN to be replayed (applied to the database).
+ After successful completion, <function>pg_last_wal_replay_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>FLUSH</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and flushed to disk. This provides a durability guarantee
+ without waiting for the WAL to be applied. After successful
+ completion, <function>pg_last_wal_receive_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ This value is also available as the <structfield>flushed_lsn</structfield>
+ column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+ <structname>pg_stat_wal_receiver</structname></link>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WRITE</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and written to disk, but not yet flushed. This is faster
+ than <literal>FLUSH</literal> but provides weaker durability
+ guarantees since the data may still be in operating system buffers.
+ After successful completion, the <structfield>written_lsn</structfield>
+ column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+ <structname>pg_stat_wal_receiver</structname></link> will show
+ a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
<listitem>
@@ -135,9 +193,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<listitem>
<para>
This return value denotes that the database server is not in a recovery
- state. This might mean either the database server was not in recovery
- at the moment of receiving the command, or it was promoted before
- reaching the target <parameter>lsn</parameter>.
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command (i.e., executed on a primary),
+ or it was promoted before reaching the target <parameter>lsn</parameter>.
+ In the promotion case, this status indicates a timeline change occurred,
+ and the application should re-evaluate whether the target LSN is still
+ relevant.
</para>
</listitem>
</varlistentry>
@@ -148,25 +209,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Notes</title>
<para>
- <command>WAIT FOR</command> command waits till
- <parameter>lsn</parameter> to be replayed on standby.
- That is, after this command execution, the value returned by
- <function>pg_last_wal_replay_lsn</function> should be greater or equal
- to the <parameter>lsn</parameter> value. This is useful to achieve
- read-your-writes-consistency, while using async replica for reads and
- primary for writes. In that case, the <acronym>lsn</acronym> of the last
- modification should be stored on the client application side or the
- connection pooler side.
+ <command>WAIT FOR</command> waits until the specified
+ <parameter>lsn</parameter> is reached according to the specified
+ <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+ for the LSN to be replayed (applied to the database), which is useful
+ to achieve read-your-writes consistency while using an async replica
+ for reads and the primary for writes. The <literal>FLUSH</literal> mode
+ waits for the WAL to be flushed to durable storage on the replica,
+ providing a durability guarantee without waiting for replay. The
+ <literal>WRITE</literal> mode waits for the WAL to be written to the
+ operating system, which is faster than flush but provides weaker
+ durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+ last modification should be stored on the client application side or
+ the connection pooler side.
</para>
<para>
- <command>WAIT FOR</command> command should be called on standby.
- If a user runs <command>WAIT FOR</command> on primary, it
- will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
- However, if <command>WAIT FOR</command> is
- called on primary promoted from standby and <literal>lsn</literal>
- was already replayed, then the <command>WAIT FOR</command> command just
- exits immediately.
+ <command>WAIT FOR</command> should be called on a standby.
+ If a user runs <command>WAIT FOR</command> on the primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified
+ in the WITH clause. However, if <command>WAIT FOR</command> is
+ called on a primary promoted from standby and <literal>lsn</literal>
+ was already reached, then the <command>WAIT FOR</command> command
+ just exits immediately. If the replica is promoted while waiting,
+ the command will return <literal>not in recovery</literal> (or throw
+ an error if <literal>NO_THROW</literal> is not specified). Promotion
+ creates a new timeline, and the LSN being waited for may refer to
+ WAL from the old timeline.
</para>
</refsect1>
@@ -175,21 +244,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Examples</title>
<para>
- You can use <command>WAIT FOR</command> command to wait for
- the <type>pg_lsn</type> value. For example, an application could update
- the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
- changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
- on primary server to get the <acronym>lsn</acronym> given that
- <varname>synchronous_commit</varname> could be set to
- <literal>off</literal>.
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
<programlisting>
postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
UPDATE 100
postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
(1 row)
</programlisting>
@@ -198,9 +267,9 @@ pg_current_wal_insert_lsn
changes made on primary should be guaranteed to be visible on replica.
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
status
---------
+---------
success
(1 row)
postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +280,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
</para>
<para>
- If the target LSN is not reached before the timeout, the error is thrown.
+ Wait for flush (data durable on replica):
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
</programlisting>
</para>
<para>
The same example uses <command>WAIT FOR</command> with
- <parameter>NO_THROW</parameter> option.
+ <parameter>NO_THROW</parameter> option:
+
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
status
---------
+---------
timeout
(1 row)
</programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b145515269..5b2a262ff8e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6240,10 +6240,12 @@ StartupXLOG(void)
LWLockRelease(ControlFileLock);
/*
- * Wake up all waiters for replay LSN. They need to report an error that
- * recovery was ended before reaching the target LSN.
+ * Wake up all waiters. They need to report an error that recovery was
+ * ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index 43b37095afb..05ad84fdb5b 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
*
* wait.c
* Implements WAIT FOR, which allows waiting for events such as
- * time passing or LSN having been replayed on replica.
+ * time passing or LSN having been replayed, flushed, or written.
*
* Portions Copyright (c) 2025, PostgreSQL Global Development Group
*
@@ -15,6 +15,7 @@
#include <math.h>
+#include "access/xlog.h"
#include "access/xlogrecovery.h"
#include "access/xlogwait.h"
#include "commands/defrem.h"
@@ -28,12 +29,28 @@
#include "utils/snapmgr.h"
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+ const char *noun; /* "replay", "flush", "write" */
+ const char *verb; /* "replayed", "flushed", "written" */
+} WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+ [WAIT_LSN_TYPE_REPLAY_STANDBY] = {"replay", "replayed"},
+ [WAIT_LSN_TYPE_WRITE_STANDBY] = {"write", "written"},
+ [WAIT_LSN_TYPE_FLUSH_STANDBY] = {"flush", "flushed"},
+};
+
void
ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
{
XLogRecPtr lsn;
int64 timeout = 0;
WaitLSNResult waitLSNResult;
+ WaitLSNType lsnType;
bool throw = true;
TupleDesc tupdesc;
TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
CStringGetDatum(stmt->lsn_literal)));
+ /* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+ switch (stmt->mode)
+ {
+ case WAIT_LSN_MODE_REPLAY:
+ lsnType = WAIT_LSN_TYPE_REPLAY_STANDBY;
+ break;
+ case WAIT_LSN_MODE_WRITE:
+ lsnType = WAIT_LSN_TYPE_WRITE_STANDBY;
+ break;
+ case WAIT_LSN_MODE_FLUSH:
+ lsnType = WAIT_LSN_TYPE_FLUSH_STANDBY;
+ break;
+ default:
+ elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+ }
+
foreach_node(DefElem, defel, stmt->options)
{
if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
}
/*
- * We are going to wait for the LSN replay. We should first care that we
- * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * We are going to wait for the LSN. We should first care that we don't
+ * hold a snapshot and correspondingly our MyProc->xmin is invalid.
* Otherwise, our snapshot could prevent the replay of WAL records
* implying a kind of self-deadlock. This is the reason why WAIT FOR is a
* command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
+ waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_TIMEOUT:
if (throw)
+ {
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
- errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
+ }
else
result = "timeout";
break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errhint("Waiting for the replay LSN can only be executed during recovery."));
+ errhint("Waiting for the %s LSN can only be executed during recovery.",
+ desc->noun));
}
else
result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c3a0a354a9c..57b3e91893c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -640,6 +640,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <windef> window_definition over_clause window_specification
opt_frame_clause frame_extent frame_bound
%type <ival> null_treatment opt_window_exclusion_clause
+%type <ival> opt_wait_lsn_mode
%type <str> opt_existing_window_name
%type <boolean> opt_if_not_exists
%type <boolean> opt_unique_null_treatment
@@ -729,7 +730,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
EXPRESSION EXTENSION EXTERNAL EXTRACT
- FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+ FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -770,7 +771,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -16489,15 +16490,23 @@ xml_passing_mech:
*****************************************************************************/
WaitStmt:
- WAIT FOR LSN_P Sconst opt_wait_with_clause
+ WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
{
WaitStmt *n = makeNode(WaitStmt);
n->lsn_literal = $4;
- n->options = $5;
+ n->mode = $5;
+ n->options = $6;
$$ = (Node *) n;
}
;
+opt_wait_lsn_mode:
+ MODE REPLAY { $$ = WAIT_LSN_MODE_REPLAY; }
+ | MODE FLUSH { $$ = WAIT_LSN_MODE_FLUSH; }
+ | MODE WRITE { $$ = WAIT_LSN_MODE_WRITE; }
+ | /*EMPTY*/ { $$ = WAIT_LSN_MODE_REPLAY; }
+ ;
+
opt_wait_with_clause:
WITH '(' utility_option_list ')' { $$ = $3; }
| /*EMPTY*/ { $$ = NIL; }
@@ -17937,6 +17946,7 @@ unreserved_keyword:
| FILTER
| FINALIZE
| FIRST_P
+ | FLUSH
| FOLLOWING
| FORCE
| FORMAT
@@ -18071,6 +18081,7 @@ unreserved_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESPECT_P
@@ -18524,6 +18535,7 @@ bare_label_keyword:
| FINALIZE
| FIRST_P
| FLOAT_P
+ | FLUSH
| FOLLOWING
| FORCE
| FOREIGN
@@ -18706,6 +18718,7 @@ bare_label_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 4217fc54e2e..be2971408e7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "catalog/pg_authid.h"
#include "funcapi.h"
#include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
/* Update shared-memory status */
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+ /*
+ * If we wrote an LSN that someone was waiting for then walk over the
+ * shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Write >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_WRITE_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_WRITE_STANDBY, LogstreamResult.Write);
+
/*
* Close the current segment if it's fully written up in the last cycle of
* the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
}
SpinLockRelease(&walrcv->mutex);
+ /*
+ * If we flushed an LSN that someone was waiting for then walk over
+ * the shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Flush >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_FLUSH_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_FLUSH_STANDBY, LogstreamResult.Flush);
+
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index d14294a4ece..bbaf3242ccb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4385,10 +4385,21 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+ WAIT_LSN_MODE_REPLAY, /* Wait for LSN replay on standby */
+ WAIT_LSN_MODE_WRITE, /* Wait for LSN write on standby */
+ WAIT_LSN_MODE_FLUSH /* Wait for LSN flush on standby */
+} WaitLSNMode;
+
typedef struct WaitStmt
{
NodeTag type;
char *lsn_literal; /* LSN string from grammar */
+ WaitLSNMode mode; /* Wait mode: REPLAY/FLUSH/WRITE */
List *options; /* List of DefElem nodes */
} WaitStmt;
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 5d4fe27ef96..7ad8b11b725 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -378,6 +379,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..df7b563cfbb 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
# the WAIT FOR command.
use strict;
use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+ my ($node) = @_;
+ $saved_primary_conninfo = $node->safe_psql('postgres',
+ "SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+ my ($node) = @_;
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
# Initialize primary node
my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
ok((split("\n", $output))[-1] eq 30,
"standby reached the same LSN as primary");
-# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+ SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout. The
# unreachable LSN must be well in advance. So WAL records issued by
# the concurrent autovacuum could not affect that.
my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
# within another function, or inside a transaction with an isolation level
# higher than READ COMMITTED.
@@ -125,7 +184,7 @@ ok( $stderr =~
/WAIT FOR must be only called without an active or registered snapshot/,
"get an error when running within another function");
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
my $test_lsn =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
@@ -208,7 +267,7 @@ $node_standby->psql(
ok( $stderr =~ /option "invalid_option" not recognized/,
"get error for invalid WITH clause option");
-# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters. We make 5 background
# psql sessions each waiting for a corresponding insertion. When waiting is
# finished, stored procedures logs if there are visible as many rows as
# should be.
@@ -226,7 +285,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
\$\$
LANGUAGE plpgsql;
]);
+
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
my @psql_sessions;
for (my $i = 0; $i < 5; $i++)
{
@@ -239,10 +300,11 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
- WAIT FOR LSN '${lsn}';
+ WAIT FOR LSN '${lsn}' MODE REPLAY;
SELECT log_count(${i});
]);
}
+
my $log_offset = -s $node_standby->logfile;
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
for (my $i = 0; $i < 5; $i++)
@@ -251,23 +313,199 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->quit;
}
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (100 + ${i});");
+ $write_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $write_sessions[$i] = $node_standby->background_psql('postgres');
+ $write_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("write_done $i", $write_log_offset);
+ $write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);");
+
+ok($output >= 0,
+ "multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (200 + ${i});");
+ $flush_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $flush_sessions[$i] = $node_standby->background_psql('postgres');
+ $flush_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+ $flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently. We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN. We stop the walreceiver and pause replay to
+# ensure all waiters block. Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+ $mixed_sessions[$i] = $node_standby->background_psql('postgres');
+ $mixed_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
-# 7. Check that the standby promotion terminates the wait on LSN. Start
-# waiting for an unreachable LSN then promote. Check the log for the relevant
-# error message. Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+ "SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+ $node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+ $mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+ "mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes. Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote. Check the log for the relevant error messages. Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
my $lsn4 =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
my $lsn5 =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
- qr/start/, qq[
- \\echo start
- WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i] = $node_standby->background_psql('postgres');
+ $wait_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+ ]);
+}
# Make sure standby will be promoted at least at the primary insert LSN we
# have just observed. Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +515,24 @@ $node_primary->wait_for_catchup($node_standby);
$log_offset = -s $node_standby->logfile;
$node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+ $log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+ $log_offset);
+$node_standby->wait_for_log(
+ qr/Recovery ended before target LSN.*was replayed/, $log_offset);
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
ok(1, 'wait for already replayed LSN exits immediately even after promotion');
$output = $node_standby->safe_psql(
'postgres', qq[
- WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
ok($output eq "not in recovery",
"WAIT FOR returns correct status after standby promotion");
@@ -295,8 +540,11 @@ ok($output eq "not in recovery",
$node_standby->stop;
$node_primary->stop;
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
# already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i]->{run}->finish;
+}
done_testing();
--
2.51.0
v4-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v4-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From af5b59d0e065ecb2f7b68c0eec8e55b892a5a435 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 19:18:06 +0800
Subject: [PATCH v4 3/4] Add tab completion for WAIT FOR LSN MODE parameter
Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 11 deletions(-)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 20d7a65c614..fcb9f19faef 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5313,10 +5313,11 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
/*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
* where option can be:
* TIMEOUT '<timeout>'
* NO_THROW
+ * MODE defaults to REPLAY if not specified.
*/
else if (Matches("WAIT"))
COMPLETE_WITH("FOR");
@@ -5325,25 +5326,41 @@ match_previous_words(int pattern_id,
else if (Matches("WAIT", "FOR", "LSN"))
/* No completion for LSN value - user must provide manually */
;
+
+ /*
+ * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+ * REPLAY
+ */
else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+ COMPLETE_WITH("MODE", "WITH");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+ COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
COMPLETE_WITH("WITH");
+ /* WITH directly after LSN (using default REPLAY mode) */
else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
COMPLETE_WITH("(");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+ COMPLETE_WITH("(");
+
+ /*
+ * Handle parenthesized option list (both with and without explicit MODE).
+ * This fires when we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as one
+ * word, so the above test is correct. timeout takes a string value,
+ * no_throw takes no value. We don't offer completions for these values.
+ */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
- /*
- * This fires if we're in an unfinished parenthesized option list.
- * get_previous_words treats a completed parenthesized option list as
- * one word, so the above test is correct.
- */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("timeout", "no_throw");
-
- /*
- * timeout takes a string value, no_throw takes no value. We don't
- * offer completions for these values.
- */
+ }
+ else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+ !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+ {
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("timeout", "no_throw");
}
/* WITH [RECURSIVE] */
--
2.51.0
v4-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v4-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From da210bfc2b62d9a38ea54b94037380144753663a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 25 Nov 2025 17:58:28 +0800
Subject: [PATCH v4] Extend xlogwait infrastructure with write and flush wait
types
Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.
Key changes:
- Add WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH_STANDBY to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
src/backend/access/transam/xlog.c | 2 +-
src/backend/access/transam/xlogrecovery.c | 4 +-
src/backend/access/transam/xlogwait.c | 84 ++++++++++++++-----
src/backend/commands/wait.c | 2 +-
.../utils/activity/wait_event_names.txt | 3 +-
src/include/access/xlogwait.h | 13 ++-
6 files changed, 81 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 22d0a2e8c3a..4b145515269 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6243,7 +6243,7 @@ StartupXLOG(void)
* Wake up all waiters for replay LSN. They need to report an error that
* recovery was ended before reaching the target LSN.
*/
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 21b8f179ba0..243c0b368a9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
*/
if (waitLSNState &&
(XLogRecoveryCtl->lastReplayedEndRecPtr >=
- pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY_STANDBY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY_STANDBY, XLogRecoveryCtl->lastReplayedEndRecPtr);
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 98aa5f1e4a2..21823acee9c 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
* This file implements waiting for WAL operations to reach specific LSNs
* on both physical standby and primary servers. The core idea is simple:
* every process that wants to wait publishes the LSN it needs to the
- * shared memory, and the appropriate process (startup on standby, or
- * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ * shared memory, and the appropriate process (startup on standby,
+ * walreceiver on standby, or WAL writer/backend on primary) wakes it
+ * once that LSN has been reached.
*
* The shared memory used by this module comprises a procInfos
* per-backend array with the information of the awaited LSN for each
* of the backend processes. The elements of that array are organized
- * into a pairing heap waitersHeap, which allows for very fast finding
- * of the least awaited LSN.
+ * into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ * allows for very fast finding of the least awaited LSN for each type.
*
- * In addition, the least-awaited LSN is cached as minWaitedLSN. The
- * waiter process publishes information about itself to the shared
- * memory and waits on the latch until it is woken up by the appropriate
- * process, standby is promoted, or the postmaster dies. Then, it cleans
- * information about itself in the shared memory.
+ * In addition, the least-awaited LSN for each type is cached in the
+ * minWaitedLSN array. The waiter process publishes information about
+ * itself to the shared memory and waits on the latch until it is woken
+ * up by the appropriate process, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
*
- * On standby servers: After replaying a WAL record, the startup process
- * first performs a fast path check minWaitedLSN > replayLSN. If this
- * check is negative, it checks waitersHeap and wakes up the backend
- * whose awaited LSNs are reached.
+ * On standby servers:
+ * - After replaying a WAL record, the startup process performs a fast
+ * path check minWaitedLSN[REPLAY] > replayLSN. If this check is
+ * negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ * whose awaited LSNs are reached.
+ * - After receiving WAL, the walreceiver process performs similar checks
+ * against the flush and write LSNs, waking up waiters in the FLUSH
+ * and WRITE heaps respectively.
*
* On primary servers: After flushing WAL, the WAL writer or backend
* process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -62,6 +68,48 @@ static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
struct WaitLSNState *waitLSNState = NULL;
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+ [WAIT_LSN_TYPE_REPLAY_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+ [WAIT_LSN_TYPE_WRITE_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+ [WAIT_LSN_TYPE_FLUSH_STANDBY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+ [WAIT_LSN_TYPE_FLUSH_PRIMARY] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+ "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+ switch (lsnType)
+ {
+ case WAIT_LSN_TYPE_REPLAY_STANDBY:
+ return GetXLogReplayRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_WRITE_STANDBY:
+ return GetWalRcvWriteRecPtr();
+
+ case WAIT_LSN_TYPE_FLUSH_STANDBY:
+ return GetWalRcvFlushRecPtr(NULL, NULL);
+
+ case WAIT_LSN_TYPE_FLUSH_PRIMARY:
+ return GetFlushRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_COUNT:
+ break;
+ }
+
+ elog(ERROR, "invalid LSN wait type: %d", lsnType);
+ pg_unreachable();
+}
+
/* Report the amount of shared memory space needed for WaitLSNState. */
Size
WaitLSNShmemSize(void)
@@ -341,13 +389,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
int rc;
long delay_ms = -1;
- if (lsnType == WAIT_LSN_TYPE_REPLAY)
- currentLSN = GetXLogReplayRecPtr(NULL);
- else
- currentLSN = GetFlushRecPtr(NULL);
+ /* Get current LSN for the wait type */
+ currentLSN = GetCurrentLSNForWaitType(lsnType);
/* Check that recovery is still in-progress */
- if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+ if (lsnType != WAIT_LSN_TYPE_FLUSH_PRIMARY && !RecoveryInProgress())
{
/*
* Recovery was ended, but check if target LSN was already
@@ -376,7 +422,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
CHECK_FOR_INTERRUPTS();
rc = WaitLatch(MyLatch, wake_events, delay_ms,
- (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+ WaitLSNWaitEvents[lsnType]);
/*
* Emergency bailout if postmaster has died. This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..43b37095afb 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY_STANDBY, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..fbcdb92dcfb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary or standby."
WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE "Waiting for WAL write to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index e607441d618..9721a7a7195 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,9 +35,15 @@ typedef enum
*/
typedef enum WaitLSNType
{
- WAIT_LSN_TYPE_REPLAY = 0, /* Waiting for replay on standby */
- WAIT_LSN_TYPE_FLUSH = 1, /* Waiting for flush on primary */
- WAIT_LSN_TYPE_COUNT = 2
+ /* Standby wait types (walreceiver/startup wakes) */
+ WAIT_LSN_TYPE_REPLAY_STANDBY = 0,
+ WAIT_LSN_TYPE_WRITE_STANDBY = 1,
+ WAIT_LSN_TYPE_FLUSH_STANDBY = 2,
+
+ /* Primary wait types (WAL writer/backends wake) */
+ WAIT_LSN_TYPE_FLUSH_PRIMARY = 3,
+
+ WAIT_LSN_TYPE_COUNT = 4
} WaitLSNType;
/*
@@ -96,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
extern Size WaitLSNShmemSize(void);
extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
--
2.51.0
Hi,
On Tue, Dec 2, 2025 at 6:10 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Tue, Dec 2, 2025 at 11:08 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Mon, Dec 1, 2025 at 12:33 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi hackers,
On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi!
At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).------
Regards,
Alexander Korotkov
SupabaseMakes sense. I'll play with it and try to prepare a follow-up patch.
--
Best,
XunengIn terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.Syntax Option A: Mode in the WITH Clause
WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.Syntax Option B: Mode as Part of the Main Command
WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';Or a more concise variant using keywords:
WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.I've implemented a patch that adds MODE support to WAIT FOR LSN
The new grammar looks like:
——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——Two modes added: flush and write
Design decisions:
1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.
3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`
The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait typesAdds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.-------
0002 - Add pg_last_wal_write_lsn() SQL functionAdds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.-------
0003 - Add MODE parameter to WAIT FOR LSN commandExtends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.-------
0004 - Add tab completion for WAIT FOR LSN MODE parameterAdds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.Feedback welcome.
Here is the updated v2 patch set. Most of the updates are in patch 3.
Changes from v1:
Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat descriptionPatch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.--
Best,
XunengHere is the updated v3. The changes are made to patch 3:
- Refactor duplicated TAP test code by extracting helper routines for
starting and stopping walreceiver.
- Increase the number of concurrent WRITE and FLUSH waiters in tests
7b and 7c from three to five, matching the number in test 7a.--
Best,
XunengJust realized that patch 2 in prior emails could be dropped for
simplicity. Since the write LSN can be retrieved directly from
pg_stat_wal_receiver, the TAP test in patch 3 does not require a
separate SQL function for this purpose alone.
Just rebase with minor changes to the wait-lsn types.
--
Best,
Xuneng
Attachments:
v5-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v5-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 9b5e818ed2807a7c2eb3ac743cbf4dfe8103ea6d Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v5 3/4] Add tab completion for WAIT FOR LSN MODE parameter
Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 11 deletions(-)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b1ff6f6cd94..8f269b5cb13 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5327,10 +5327,11 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
/*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
* where option can be:
* TIMEOUT '<timeout>'
* NO_THROW
+ * MODE defaults to REPLAY if not specified.
*/
else if (Matches("WAIT"))
COMPLETE_WITH("FOR");
@@ -5339,25 +5340,41 @@ match_previous_words(int pattern_id,
else if (Matches("WAIT", "FOR", "LSN"))
/* No completion for LSN value - user must provide manually */
;
+
+ /*
+ * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+ * REPLAY
+ */
else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+ COMPLETE_WITH("MODE", "WITH");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+ COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
COMPLETE_WITH("WITH");
+ /* WITH directly after LSN (using default REPLAY mode) */
else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
COMPLETE_WITH("(");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+ COMPLETE_WITH("(");
+
+ /*
+ * Handle parenthesized option list (both with and without explicit MODE).
+ * This fires when we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as one
+ * word, so the above test is correct. timeout takes a string value,
+ * no_throw takes no value. We don't offer completions for these values.
+ */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
- /*
- * This fires if we're in an unfinished parenthesized option list.
- * get_previous_words treats a completed parenthesized option list as
- * one word, so the above test is correct.
- */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("timeout", "no_throw");
-
- /*
- * timeout takes a string value, no_throw takes no value. We don't
- * offer completions for these values.
- */
+ }
+ else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+ !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+ {
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("timeout", "no_throw");
}
/* WITH [RECURSIVE] */
--
2.51.0
v5-0004-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v5-0004-Use-WAIT-FOR-LSN-in.patchDownload
From dd82542b2a4961fd050eab70ea66a1c152edefdc Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v5 4/4] Use WAIT FOR LSN in
PostgreSQL::Test::Cluster::wait_for_catchup()
Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.
The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
1 file changed, 32 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..eec8233b515 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3335,6 +3335,9 @@ sub wait_for_catchup
$mode = defined($mode) ? $mode : 'replay';
my %valid_modes =
('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+ my $isrecovery =
+ $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+ chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
- my $isrecovery =
- $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
- chomp($isrecovery);
if ($isrecovery eq 't')
{
$target_lsn = $self->lsn('replay');
@@ -3367,6 +3367,35 @@ sub wait_for_catchup
. $self->name . "\n";
# Before release 12 walreceiver just set the application name to
# "walreceiver"
+
+ # Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+ # This is more efficient than polling pg_stat_replication
+ if (($mode ne 'sent') && ($isrecovery eq 't'))
+ {
+ my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+ # Map mode names to WAIT FOR LSN MODE values (uppercase)
+ my $wait_mode = uc($mode);
+ my $query =
+ qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+ my $output = $self->safe_psql('postgres', $query);
+ chomp($output);
+
+ if ($output ne 'success')
+ {
+ # Fetch additional detail for debugging purposes
+ $query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+ my $details = $self->safe_psql('postgres', $query);
+ diag qq(WAIT FOR LSN failed with status:
+${output});
+ diag qq(Last pg_stat_replication contents:
+${details});
+ croak "failed waiting for catchup";
+ }
+ print "done\n";
+ return;
+ }
+
+ # Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('$standby_name', 'walreceiver')];
--
2.51.0
v5-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v5-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From 66c509e07bcbaa4580b32266326e34487a16d683 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v5 1/4] Extend xlogwait infrastructure with write and flush
wait types
Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.
Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
src/backend/access/transam/xlog.c | 2 +-
src/backend/access/transam/xlogrecovery.c | 4 +-
src/backend/access/transam/xlogwait.c | 84 ++++++++++++++-----
src/backend/commands/wait.c | 2 +-
.../utils/activity/wait_event_names.txt | 3 +-
src/include/access/xlogwait.h | 12 ++-
6 files changed, 80 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a5640df51a..a6e348f2109 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6241,7 +6241,7 @@ StartupXLOG(void)
* Wake up all waiters for replay LSN. They need to report an error that
* recovery was ended before reaching the target LSN.
*/
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ae2398d6975..01ffe30ffee 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
*/
if (waitLSNState &&
(XLogRecoveryCtl->lastReplayedEndRecPtr >=
- pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..726a4a14084 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
* This file implements waiting for WAL operations to reach specific LSNs
* on both physical standby and primary servers. The core idea is simple:
* every process that wants to wait publishes the LSN it needs to the
- * shared memory, and the appropriate process (startup on standby, or
- * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ * shared memory, and the appropriate process (startup on standby,
+ * walreceiver on standby, or WAL writer/backend on primary) wakes it
+ * once that LSN has been reached.
*
* The shared memory used by this module comprises a procInfos
* per-backend array with the information of the awaited LSN for each
* of the backend processes. The elements of that array are organized
- * into a pairing heap waitersHeap, which allows for very fast finding
- * of the least awaited LSN.
+ * into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ * allows for very fast finding of the least awaited LSN for each type.
*
- * In addition, the least-awaited LSN is cached as minWaitedLSN. The
- * waiter process publishes information about itself to the shared
- * memory and waits on the latch until it is woken up by the appropriate
- * process, standby is promoted, or the postmaster dies. Then, it cleans
- * information about itself in the shared memory.
+ * In addition, the least-awaited LSN for each type is cached in the
+ * minWaitedLSN array. The waiter process publishes information about
+ * itself to the shared memory and waits on the latch until it is woken
+ * up by the appropriate process, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
*
- * On standby servers: After replaying a WAL record, the startup process
- * first performs a fast path check minWaitedLSN > replayLSN. If this
- * check is negative, it checks waitersHeap and wakes up the backend
- * whose awaited LSNs are reached.
+ * On standby servers:
+ * - After replaying a WAL record, the startup process performs a fast
+ * path check minWaitedLSN[REPLAY] > replayLSN. If this check is
+ * negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ * whose awaited LSNs are reached.
+ * - After receiving WAL, the walreceiver process performs similar checks
+ * against the flush and write LSNs, waking up waiters in the FLUSH
+ * and WRITE heaps respectively.
*
* On primary servers: After flushing WAL, the WAL writer or backend
* process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -62,6 +68,48 @@ static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
struct WaitLSNState *waitLSNState = NULL;
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+ [WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+ [WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+ [WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+ [WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+ "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+ switch (lsnType)
+ {
+ case WAIT_LSN_TYPE_STANDBY_REPLAY:
+ return GetXLogReplayRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_STANDBY_WRITE:
+ return GetWalRcvWriteRecPtr();
+
+ case WAIT_LSN_TYPE_STANDBY_FLUSH:
+ return GetWalRcvFlushRecPtr(NULL, NULL);
+
+ case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+ return GetFlushRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_COUNT:
+ break;
+ }
+
+ elog(ERROR, "invalid LSN wait type: %d", lsnType);
+ pg_unreachable();
+}
+
/* Report the amount of shared memory space needed for WaitLSNState. */
Size
WaitLSNShmemSize(void)
@@ -341,13 +389,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
int rc;
long delay_ms = -1;
- if (lsnType == WAIT_LSN_TYPE_REPLAY)
- currentLSN = GetXLogReplayRecPtr(NULL);
- else
- currentLSN = GetFlushRecPtr(NULL);
+ /* Get current LSN for the wait type */
+ currentLSN = GetCurrentLSNForWaitType(lsnType);
/* Check that recovery is still in-progress */
- if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+ if (lsnType != WAIT_LSN_TYPE_PRIMARY_FLUSH && !RecoveryInProgress())
{
/*
* Recovery was ended, but check if target LSN was already
@@ -376,7 +422,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
CHECK_FOR_INTERRUPTS();
rc = WaitLatch(MyLatch, wake_events, delay_ms,
- (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+ WaitLSNWaitEvents[lsnType]);
/*
* Emergency bailout if postmaster has died. This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c0632bf901a..05bd4376c67 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary or standby."
WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE "Waiting for WAL write to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..3b2f34b8698 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,11 +35,16 @@ typedef enum
*/
typedef enum WaitLSNType
{
- WAIT_LSN_TYPE_REPLAY, /* Waiting for replay on standby */
- WAIT_LSN_TYPE_FLUSH, /* Waiting for flush on primary */
+ /* Standby wait types (walreceiver/startup wakes) */
+ WAIT_LSN_TYPE_STANDBY_REPLAY,
+ WAIT_LSN_TYPE_STANDBY_WRITE,
+ WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+ /* Primary wait types (WAL writer/backends wake) */
+ WAIT_LSN_TYPE_PRIMARY_FLUSH,
} WaitLSNType;
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
/*
* WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
extern Size WaitLSNShmemSize(void);
extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
--
2.51.0
v5-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v5-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From ac01547201b1098c31e9bb46594896b677207bd8 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v5 2/4] Add MODE parameter to WAIT FOR LSN command
Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN
The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.
The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.
Also includes:
- Documentation updates for the new syntax and refactoring
of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
doc/src/sgml/ref/wait_for.sgml | 192 +++++++++++----
src/backend/access/transam/xlog.c | 6 +-
src/backend/commands/wait.c | 64 ++++-
src/backend/parser/gram.y | 21 +-
src/backend/replication/walreceiver.c | 19 ++
src/include/nodes/parsenodes.h | 11 +
src/include/parser/kwlist.h | 2 +
src/test/recovery/t/049_wait_for_lsn.pl | 295 ++++++++++++++++++++++--
8 files changed, 523 insertions(+), 87 deletions(-)
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..28c68678315 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
<refnamediv>
<refname>WAIT FOR</refname>
- <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
</refnamediv>
<refsynopsisdiv>
<synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+ [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Description</title>
<para>
- Waits until recovery replays <parameter>lsn</parameter>.
- If no <parameter>timeout</parameter> is specified or it is set to
- zero, this command waits indefinitely for the
- <parameter>lsn</parameter>.
- On timeout, or if the server is promoted before
- <parameter>lsn</parameter> is reached, an error is emitted,
- unless <literal>NO_THROW</literal> is specified in the WITH clause.
- If <parameter>NO_THROW</parameter> is specified, then the command
- doesn't throw errors.
+ Waits until the specified <parameter>lsn</parameter> is reached
+ according to the specified <parameter>mode</parameter>,
+ which determines whether to wait for WAL to be written, flushed, or replayed.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
</para>
<para>
- The possible return values are <literal>success</literal>,
- <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
</para>
</refsect1>
@@ -64,6 +67,61 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>MODE</literal></term>
+ <listitem>
+ <para>
+ Specifies the type of LSN processing to wait for. If not specified,
+ the default is <literal>REPLAY</literal>. The valid modes are:
+ </para>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>REPLAY</literal></term>
+ <listitem>
+ <para>
+ Wait for the LSN to be replayed (applied to the database).
+ After successful completion, <function>pg_last_wal_replay_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>FLUSH</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and flushed to disk. This provides a durability guarantee
+ without waiting for the WAL to be applied. After successful
+ completion, <function>pg_last_wal_receive_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ This value is also available as the <structfield>flushed_lsn</structfield>
+ column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+ <structname>pg_stat_wal_receiver</structname></link>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WRITE</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and written to disk, but not yet flushed. This is faster
+ than <literal>FLUSH</literal> but provides weaker durability
+ guarantees since the data may still be in operating system buffers.
+ After successful completion, the <structfield>written_lsn</structfield>
+ column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+ <structname>pg_stat_wal_receiver</structname></link> will show
+ a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
<listitem>
@@ -135,9 +193,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<listitem>
<para>
This return value denotes that the database server is not in a recovery
- state. This might mean either the database server was not in recovery
- at the moment of receiving the command, or it was promoted before
- reaching the target <parameter>lsn</parameter>.
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command (i.e., executed on a primary),
+ or it was promoted before reaching the target <parameter>lsn</parameter>.
+ In the promotion case, this status indicates a timeline change occurred,
+ and the application should re-evaluate whether the target LSN is still
+ relevant.
</para>
</listitem>
</varlistentry>
@@ -148,25 +209,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Notes</title>
<para>
- <command>WAIT FOR</command> command waits till
- <parameter>lsn</parameter> to be replayed on standby.
- That is, after this command execution, the value returned by
- <function>pg_last_wal_replay_lsn</function> should be greater or equal
- to the <parameter>lsn</parameter> value. This is useful to achieve
- read-your-writes-consistency, while using async replica for reads and
- primary for writes. In that case, the <acronym>lsn</acronym> of the last
- modification should be stored on the client application side or the
- connection pooler side.
+ <command>WAIT FOR</command> waits until the specified
+ <parameter>lsn</parameter> is reached according to the specified
+ <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+ for the LSN to be replayed (applied to the database), which is useful
+ to achieve read-your-writes consistency while using an async replica
+ for reads and the primary for writes. The <literal>FLUSH</literal> mode
+ waits for the WAL to be flushed to durable storage on the replica,
+ providing a durability guarantee without waiting for replay. The
+ <literal>WRITE</literal> mode waits for the WAL to be written to the
+ operating system, which is faster than flush but provides weaker
+ durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+ last modification should be stored on the client application side or
+ the connection pooler side.
</para>
<para>
- <command>WAIT FOR</command> command should be called on standby.
- If a user runs <command>WAIT FOR</command> on primary, it
- will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
- However, if <command>WAIT FOR</command> is
- called on primary promoted from standby and <literal>lsn</literal>
- was already replayed, then the <command>WAIT FOR</command> command just
- exits immediately.
+ <command>WAIT FOR</command> should be called on a standby.
+ If a user runs <command>WAIT FOR</command> on the primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified
+ in the WITH clause. However, if <command>WAIT FOR</command> is
+ called on a primary promoted from standby and <literal>lsn</literal>
+ was already reached, then the <command>WAIT FOR</command> command
+ just exits immediately. If the replica is promoted while waiting,
+ the command will return <literal>not in recovery</literal> (or throw
+ an error if <literal>NO_THROW</literal> is not specified). Promotion
+ creates a new timeline, and the LSN being waited for may refer to
+ WAL from the old timeline.
</para>
</refsect1>
@@ -175,21 +244,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Examples</title>
<para>
- You can use <command>WAIT FOR</command> command to wait for
- the <type>pg_lsn</type> value. For example, an application could update
- the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
- changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
- on primary server to get the <acronym>lsn</acronym> given that
- <varname>synchronous_commit</varname> could be set to
- <literal>off</literal>.
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
<programlisting>
postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
UPDATE 100
postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
(1 row)
</programlisting>
@@ -198,9 +267,9 @@ pg_current_wal_insert_lsn
changes made on primary should be guaranteed to be visible on replica.
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
status
---------
+---------
success
(1 row)
postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +280,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
</para>
<para>
- If the target LSN is not reached before the timeout, the error is thrown.
+ Wait for flush (data durable on replica):
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
</programlisting>
</para>
<para>
The same example uses <command>WAIT FOR</command> with
- <parameter>NO_THROW</parameter> option.
+ <parameter>NO_THROW</parameter> option:
+
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
status
---------
+---------
timeout
(1 row)
</programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6e348f2109..5c6f9feeccc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6238,10 +6238,12 @@ StartupXLOG(void)
LWLockRelease(ControlFileLock);
/*
- * Wake up all waiters for replay LSN. They need to report an error that
- * recovery was ended before reaching the target LSN.
+ * Wake up all waiters. They need to report an error that recovery was
+ * ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..60cf3ee1c9a 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
*
* wait.c
* Implements WAIT FOR, which allows waiting for events such as
- * time passing or LSN having been replayed on replica.
+ * time passing or LSN having been replayed, flushed, or written.
*
* Portions Copyright (c) 2025, PostgreSQL Global Development Group
*
@@ -15,6 +15,7 @@
#include <math.h>
+#include "access/xlog.h"
#include "access/xlogrecovery.h"
#include "access/xlogwait.h"
#include "commands/defrem.h"
@@ -28,12 +29,28 @@
#include "utils/snapmgr.h"
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+ const char *noun; /* "replay", "flush", "write" */
+ const char *verb; /* "replayed", "flushed", "written" */
+} WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+ [WAIT_LSN_TYPE_STANDBY_REPLAY] = {"replay", "replayed"},
+ [WAIT_LSN_TYPE_STANDBY_WRITE] = {"write", "written"},
+ [WAIT_LSN_TYPE_STANDBY_FLUSH] = {"flush", "flushed"},
+};
+
void
ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
{
XLogRecPtr lsn;
int64 timeout = 0;
WaitLSNResult waitLSNResult;
+ WaitLSNType lsnType;
bool throw = true;
TupleDesc tupdesc;
TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
CStringGetDatum(stmt->lsn_literal)));
+ /* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+ switch (stmt->mode)
+ {
+ case WAIT_LSN_MODE_REPLAY:
+ lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+ break;
+ case WAIT_LSN_MODE_WRITE:
+ lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+ break;
+ case WAIT_LSN_MODE_FLUSH:
+ lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+ break;
+ default:
+ elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+ }
+
foreach_node(DefElem, defel, stmt->options)
{
if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
}
/*
- * We are going to wait for the LSN replay. We should first care that we
- * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * We are going to wait for the LSN. We should first care that we don't
+ * hold a snapshot and correspondingly our MyProc->xmin is invalid.
* Otherwise, our snapshot could prevent the replay of WAL records
* implying a kind of self-deadlock. This is the reason why WAIT FOR is a
* command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_TIMEOUT:
if (throw)
+ {
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
- errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
+ }
else
result = "timeout";
break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errhint("Waiting for the replay LSN can only be executed during recovery."));
+ errhint("Waiting for the %s LSN can only be executed during recovery.",
+ desc->noun));
}
else
result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 28f4e11e30f..94a9e874699 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -641,6 +641,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <windef> window_definition over_clause window_specification
opt_frame_clause frame_extent frame_bound
%type <ival> null_treatment opt_window_exclusion_clause
+%type <ival> opt_wait_lsn_mode
%type <str> opt_existing_window_name
%type <boolean> opt_if_not_exists
%type <boolean> opt_unique_null_treatment
@@ -732,7 +733,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
EXPRESSION EXTENSION EXTERNAL EXTRACT
- FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+ FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -773,7 +774,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -16541,15 +16542,23 @@ xml_passing_mech:
*****************************************************************************/
WaitStmt:
- WAIT FOR LSN_P Sconst opt_wait_with_clause
+ WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
{
WaitStmt *n = makeNode(WaitStmt);
n->lsn_literal = $4;
- n->options = $5;
+ n->mode = $5;
+ n->options = $6;
$$ = (Node *) n;
}
;
+opt_wait_lsn_mode:
+ MODE REPLAY { $$ = WAIT_LSN_MODE_REPLAY; }
+ | MODE FLUSH { $$ = WAIT_LSN_MODE_FLUSH; }
+ | MODE WRITE { $$ = WAIT_LSN_MODE_WRITE; }
+ | /*EMPTY*/ { $$ = WAIT_LSN_MODE_REPLAY; }
+ ;
+
opt_wait_with_clause:
WITH '(' utility_option_list ')' { $$ = $3; }
| /*EMPTY*/ { $$ = NIL; }
@@ -17989,6 +17998,7 @@ unreserved_keyword:
| FILTER
| FINALIZE
| FIRST_P
+ | FLUSH
| FOLLOWING
| FORCE
| FORMAT
@@ -18124,6 +18134,7 @@ unreserved_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESPECT_P
@@ -18578,6 +18589,7 @@ bare_label_keyword:
| FINALIZE
| FIRST_P
| FLOAT_P
+ | FLUSH
| FOLLOWING
| FORCE
| FOREIGN
@@ -18761,6 +18773,7 @@ bare_label_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..e15c5645b9c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "catalog/pg_authid.h"
#include "funcapi.h"
#include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
/* Update shared-memory status */
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+ /*
+ * If we wrote an LSN that someone was waiting for then walk over the
+ * shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Write >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
/*
* Close the current segment if it's fully written up in the last cycle of
* the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
}
SpinLockRelease(&walrcv->mutex);
+ /*
+ * If we flushed an LSN that someone was waiting for then walk over
+ * the shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Flush >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index bc7adba4a0f..c4d9f03a6a5 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4413,10 +4413,21 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+ WAIT_LSN_MODE_REPLAY, /* Wait for LSN replay on standby */
+ WAIT_LSN_MODE_WRITE, /* Wait for LSN write on standby */
+ WAIT_LSN_MODE_FLUSH /* Wait for LSN flush on standby */
+} WaitLSNMode;
+
typedef struct WaitStmt
{
NodeTag type;
char *lsn_literal; /* LSN string from grammar */
+ WaitLSNMode mode; /* Wait mode: REPLAY/FLUSH/WRITE */
List *options; /* List of DefElem nodes */
} WaitStmt;
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 9fde58f541c..04008805e46 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -379,6 +380,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..98060a5c79f 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
# the WAIT FOR command.
use strict;
use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+ my ($node) = @_;
+ $saved_primary_conninfo = $node->safe_psql('postgres',
+ "SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+ my ($node) = @_;
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
# Initialize primary node
my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
ok((split("\n", $output))[-1] eq 30,
"standby reached the same LSN as primary");
-# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+ SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout. The
# unreachable LSN must be well in advance. So WAL records issued by
# the concurrent autovacuum could not affect that.
my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
# within another function, or inside a transaction with an isolation level
# higher than READ COMMITTED.
@@ -125,7 +184,7 @@ ok( $stderr =~
/WAIT FOR must be only called without an active or registered snapshot/,
"get an error when running within another function");
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
my $test_lsn =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
@@ -208,7 +267,7 @@ $node_standby->psql(
ok( $stderr =~ /option "invalid_option" not recognized/,
"get error for invalid WITH clause option");
-# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters. We make 5 background
# psql sessions each waiting for a corresponding insertion. When waiting is
# finished, stored procedures logs if there are visible as many rows as
# should be.
@@ -226,7 +285,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
\$\$
LANGUAGE plpgsql;
]);
+
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
my @psql_sessions;
for (my $i = 0; $i < 5; $i++)
{
@@ -239,10 +300,11 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
- WAIT FOR LSN '${lsn}';
+ WAIT FOR LSN '${lsn}' MODE REPLAY;
SELECT log_count(${i});
]);
}
+
my $log_offset = -s $node_standby->logfile;
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
for (my $i = 0; $i < 5; $i++)
@@ -251,23 +313,200 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->quit;
}
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
-# 7. Check that the standby promotion terminates the wait on LSN. Start
-# waiting for an unreachable LSN then promote. Check the log for the relevant
-# error message. Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (100 + ${i});");
+ $write_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $write_sessions[$i] = $node_standby->background_psql('postgres');
+ $write_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("write_done $i", $write_log_offset);
+ $write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (200 + ${i});");
+ $flush_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $flush_sessions[$i] = $node_standby->background_psql('postgres');
+ $flush_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+ $flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently. We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN. We stop the walreceiver and pause replay to
+# ensure all waiters block. Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+ $mixed_sessions[$i] = $node_standby->background_psql('postgres');
+ $mixed_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+ "SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+ $node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+ $mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+ "mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes. Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote. Check the log for the relevant error messages. Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
my $lsn4 =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
my $lsn5 =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
- qr/start/, qq[
- \\echo start
- WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i] = $node_standby->background_psql('postgres');
+ $wait_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+ ]);
+}
# Make sure standby will be promoted at least at the primary insert LSN we
# have just observed. Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +516,24 @@ $node_primary->wait_for_catchup($node_standby);
$log_offset = -s $node_standby->logfile;
$node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+ $log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+ $log_offset);
+$node_standby->wait_for_log(
+ qr/Recovery ended before target LSN.*was replayed/, $log_offset);
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
ok(1, 'wait for already replayed LSN exits immediately even after promotion');
$output = $node_standby->safe_psql(
'postgres', qq[
- WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
ok($output eq "not in recovery",
"WAIT FOR returns correct status after standby promotion");
@@ -295,8 +541,11 @@ ok($output eq "not in recovery",
$node_standby->stop;
$node_primary->stop;
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
# already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i]->{run}->finish;
+}
done_testing();
--
2.51.0
Hi,
On Tue, Dec 16, 2025 at 11:28 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Tue, Dec 2, 2025 at 6:10 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Tue, Dec 2, 2025 at 11:08 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Mon, Dec 1, 2025 at 12:33 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi hackers,
On Tue, Nov 25, 2025 at 7:51 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi!
At the moment, the WAIT FOR LSN command supports only the replay mode.
If we intend to extend its functionality more broadly, one option is
to add a mode option or something similar. Are users expected to wait
for flush(or others) completion in such cases? If not, and the TAP
test is the only intended use, this approach might be a bit of an
overkill.I would say that adding mode parameter seems to be a pretty natural
extension of what we have at the moment. I can imagine some
clustering solution can use it to wait for certain transaction to be
flushed at the replica (without delaying the commit at the primary).------
Regards,
Alexander Korotkov
SupabaseMakes sense. I'll play with it and try to prepare a follow-up patch.
--
Best,
XunengIn terms of extending the functionality of the command, I see two
possible approaches here. One is to keep mode as a mandatory keyword,
and the other is to introduce it as an option in the WITH clause.Syntax Option A: Mode in the WITH Clause
WAIT FOR LSN '0/12345' WITH (mode = 'replay');
WAIT FOR LSN '0/12345' WITH (mode = 'flush');
WAIT FOR LSN '0/12345' WITH (mode = 'write');With this option, we can keep "replay" as the default mode. That means
existing TAP tests won’t need to be refactored unless they explicitly
want a different mode.Syntax Option B: Mode as Part of the Main Command
WAIT FOR LSN '0/12345' MODE 'replay';
WAIT FOR LSN '0/12345' MODE 'flush';
WAIT FOR LSN '0/12345' MODE 'write';Or a more concise variant using keywords:
WAIT FOR LSN '0/12345' REPLAY;
WAIT FOR LSN '0/12345' FLUSH;
WAIT FOR LSN '0/12345' WRITE;This option produces a cleaner syntax if the intent is simply to wait
for a particular LSN type, without specifying additional options like
timeout or no_throw.I don’t have a clear preference among them. I’d be interested to hear
what you or others think is the better direction.I've implemented a patch that adds MODE support to WAIT FOR LSN
The new grammar looks like:
——
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
——Two modes added: flush and write
Design decisions:
1. MODE as a separate keyword (not in WITH clause) - This follows the
pattern used by LOCK command. It also makes the common case more
concise.2. REPLAY as the default - When MODE is not specified, it defaults to REPLAY.
3. Keywords rather than strings - Using `MODE WRITE` rather than `MODE 'write'`
The patch set includes:
-------
0001 - Extend xlogwait infrastructure with write and flush wait typesAdds WAIT_LSN_TYPE_WRITE and WAIT_LSN_TYPE_FLUSH to WaitLSNType enum,
along with corresponding wait events and pairing heaps. Introduces
GetCurrentLSNForWaitType() to retrieve the appropriate LSN based on
wait type, and adds wakeup calls in walreceiver for write/flush
events.-------
0002 - Add pg_last_wal_write_lsn() SQL functionAdds a new SQL function that returns the current WAL write position on
a standby using GetWalRcvWriteRecPtr(). This complements existing
pg_last_wal_receive_lsn() (flush) and pg_last_wal_replay_lsn()
functions, enabling verification of WAIT FOR LSN MODE WRITE in TAP
tests.-------
0003 - Add MODE parameter to WAIT FOR LSN commandExtends the parser and executor to support the optional MODE
parameter. Updates documentation with new syntax and mode
descriptions. Adds TAP tests covering all three modes including
mixed-mode concurrent waiters.-------
0004 - Add tab completion for WAIT FOR LSN MODE parameterAdds psql tab completion support: completes MODE after LSN value,
completes REPLAY/WRITE/FLUSH after MODE keyword, and completes WITH
after mode selection.-------
0005 - Use WAIT FOR LSN in PostgreSQL::Test::Cluster::wait_for_catchup()Replaces polling-based wait_for_catchup() with WAIT FOR LSN when the
target is a standby in recovery, improving test efficiency by avoiding
repeated queries.The WRITE and FLUSH modes enable scenarios where applications need to
ensure WAL has been received or persisted on the standby without
waiting for replay to complete.Feedback welcome.
Here is the updated v2 patch set. Most of the updates are in patch 3.
Changes from v1:
Patch 1 (Extend wait types in xlogwait infra)
- Renamed enum values for consistency (WAIT_LSN_TYPE_REPLAY →
WAIT_LSN_TYPE_REPLAY_STANDBY, etc.)Patch 2 (pg_last_wal_write_lsn):
- Clarified documentation and comment
- Improved pg_proc.dat descriptionPatch 3 (MODE parameter):
- Replaced direct cast with explicit switch statement for WaitLSNMode
→ WaitLSNType conversion
- Improved FLUSH/WRITE mode documentation with verification function references
- TAP tests (7b, 7c, 7d): Added walreceiver control for concurrency,
explicit blocking verification via poll_query_until, and log-based
completion verification via wait_for_log
- Fix the timing issue in wait for all three sessions to get the
errors after promotion of tap test 8.--
Best,
XunengHere is the updated v3. The changes are made to patch 3:
- Refactor duplicated TAP test code by extracting helper routines for
starting and stopping walreceiver.
- Increase the number of concurrent WRITE and FLUSH waiters in tests
7b and 7c from three to five, matching the number in test 7a.--
Best,
XunengJust realized that patch 2 in prior emails could be dropped for
simplicity. Since the write LSN can be retrieved directly from
pg_stat_wal_receiver, the TAP test in patch 3 does not require a
separate SQL function for this purpose alone.Just rebase with minor changes to the wait-lsn types.
Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.
--
Best,
Xuneng
Attachments:
v6-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v6-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From 7292901a0119dca75c349cd6f5a460f5cb0e4139 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v6 1/4] Extend xlogwait infrastructure with write and flush
wait types
Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.
Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
src/backend/access/transam/xlog.c | 2 +-
src/backend/access/transam/xlogrecovery.c | 4 +-
src/backend/access/transam/xlogwait.c | 81 ++++++++++++++-----
src/backend/commands/wait.c | 2 +-
.../utils/activity/wait_event_names.txt | 3 +-
src/include/access/xlogwait.h | 12 ++-
6 files changed, 77 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a5640df51a..a6e348f2109 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6241,7 +6241,7 @@ StartupXLOG(void)
* Wake up all waiters for replay LSN. They need to report an error that
* recovery was ended before reaching the target LSN.
*/
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ae2398d6975..01ffe30ffee 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
*/
if (waitLSNState &&
(XLogRecoveryCtl->lastReplayedEndRecPtr >=
- pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..d54b2fd7ae4 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
* This file implements waiting for WAL operations to reach specific LSNs
* on both physical standby and primary servers. The core idea is simple:
* every process that wants to wait publishes the LSN it needs to the
- * shared memory, and the appropriate process (startup on standby, or
- * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ * shared memory, and the appropriate process (startup on standby,
+ * walreceiver on standby, or WAL writer/backend on primary) wakes it
+ * once that LSN has been reached.
*
* The shared memory used by this module comprises a procInfos
* per-backend array with the information of the awaited LSN for each
* of the backend processes. The elements of that array are organized
- * into a pairing heap waitersHeap, which allows for very fast finding
- * of the least awaited LSN.
+ * into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ * allows for very fast finding of the least awaited LSN for each type.
*
- * In addition, the least-awaited LSN is cached as minWaitedLSN. The
- * waiter process publishes information about itself to the shared
- * memory and waits on the latch until it is woken up by the appropriate
- * process, standby is promoted, or the postmaster dies. Then, it cleans
- * information about itself in the shared memory.
+ * In addition, the least-awaited LSN for each type is cached in the
+ * minWaitedLSN array. The waiter process publishes information about
+ * itself to the shared memory and waits on the latch until it is woken
+ * up by the appropriate process, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
*
- * On standby servers: After replaying a WAL record, the startup process
- * first performs a fast path check minWaitedLSN > replayLSN. If this
- * check is negative, it checks waitersHeap and wakes up the backend
- * whose awaited LSNs are reached.
+ * On standby servers:
+ * - After replaying a WAL record, the startup process performs a fast
+ * path check minWaitedLSN[REPLAY] > replayLSN. If this check is
+ * negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ * whose awaited LSNs are reached.
+ * - After receiving WAL, the walreceiver process performs similar checks
+ * against the flush and write LSNs, waking up waiters in the FLUSH
+ * and WRITE heaps respectively.
*
* On primary servers: After flushing WAL, the WAL writer or backend
* process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -62,6 +68,45 @@ static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
struct WaitLSNState *waitLSNState = NULL;
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+ [WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+ [WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+ [WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+ [WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+ "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+ switch (lsnType)
+ {
+ case WAIT_LSN_TYPE_STANDBY_REPLAY:
+ return GetXLogReplayRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_STANDBY_WRITE:
+ return GetWalRcvWriteRecPtr();
+
+ case WAIT_LSN_TYPE_STANDBY_FLUSH:
+ return GetWalRcvFlushRecPtr(NULL, NULL);
+
+ case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+ return GetFlushRecPtr(NULL);
+ }
+
+ elog(ERROR, "invalid LSN wait type: %d", lsnType);
+ pg_unreachable();
+}
+
/* Report the amount of shared memory space needed for WaitLSNState. */
Size
WaitLSNShmemSize(void)
@@ -341,13 +386,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
int rc;
long delay_ms = -1;
- if (lsnType == WAIT_LSN_TYPE_REPLAY)
- currentLSN = GetXLogReplayRecPtr(NULL);
- else
- currentLSN = GetFlushRecPtr(NULL);
+ /* Get current LSN for the wait type */
+ currentLSN = GetCurrentLSNForWaitType(lsnType);
/* Check that recovery is still in-progress */
- if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+ if (lsnType != WAIT_LSN_TYPE_PRIMARY_FLUSH && !RecoveryInProgress())
{
/*
* Recovery was ended, but check if target LSN was already
@@ -376,7 +419,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
CHECK_FOR_INTERRUPTS();
rc = WaitLatch(MyLatch, wake_events, delay_ms,
- (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+ WaitLSNWaitEvents[lsnType]);
/*
* Emergency bailout if postmaster has died. This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c0632bf901a..05bd4376c67 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary or standby."
WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE "Waiting for WAL write to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..3b2f34b8698 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -35,11 +35,16 @@ typedef enum
*/
typedef enum WaitLSNType
{
- WAIT_LSN_TYPE_REPLAY, /* Waiting for replay on standby */
- WAIT_LSN_TYPE_FLUSH, /* Waiting for flush on primary */
+ /* Standby wait types (walreceiver/startup wakes) */
+ WAIT_LSN_TYPE_STANDBY_REPLAY,
+ WAIT_LSN_TYPE_STANDBY_WRITE,
+ WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+ /* Primary wait types (WAL writer/backends wake) */
+ WAIT_LSN_TYPE_PRIMARY_FLUSH,
} WaitLSNType;
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
/*
* WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
extern Size WaitLSNShmemSize(void);
extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
--
2.51.0
v6-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v6-0002-Add-MODE-parameter-to-WAIT-FOR-LSN-command.patchDownload
From 0df07ec61ec10096782262d7fcb996e879cf2367 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v6 2/4] Add MODE parameter to WAIT FOR LSN command
Extend the WAIT FOR LSN command with an optional MODE parameter that
specifies which LSN type to wait for:
WAIT FOR LSN '<lsn>' [MODE { REPLAY | WRITE | FLUSH }] [WITH (...)]
- REPLAY (default): Wait for WAL to be replayed to the specified LSN
- WRITE: Wait for WAL to be written (received) to the specified LSN
- FLUSH: Wait for WAL to be flushed to disk at the specified LSN
The default mode is REPLAY, matching the original behavior when MODE
is not specified. This follows the pattern used by LOCK command where
the mode parameter is optional with a sensible default.
The WRITE and FLUSH modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.
Also includes:
- Documentation updates for the new syntax and refactoring
of existing WAIT FOR command documentation
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for WRITE/FLUSH waiters
---
doc/src/sgml/ref/wait_for.sgml | 192 +++++++++++----
src/backend/access/transam/xlog.c | 6 +-
src/backend/commands/wait.c | 64 ++++-
src/backend/parser/gram.y | 21 +-
src/backend/replication/walreceiver.c | 19 ++
src/include/nodes/parsenodes.h | 11 +
src/include/parser/kwlist.h | 2 +
src/test/recovery/t/049_wait_for_lsn.pl | 295 ++++++++++++++++++++++--
8 files changed, 523 insertions(+), 87 deletions(-)
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..28c68678315 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,12 +16,13 @@ PostgreSQL documentation
<refnamediv>
<refname>WAIT FOR</refname>
- <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
</refnamediv>
<refsynopsisdiv>
<synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ MODE { REPLAY | FLUSH | WRITE } ]
+ [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
@@ -34,20 +35,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Description</title>
<para>
- Waits until recovery replays <parameter>lsn</parameter>.
- If no <parameter>timeout</parameter> is specified or it is set to
- zero, this command waits indefinitely for the
- <parameter>lsn</parameter>.
- On timeout, or if the server is promoted before
- <parameter>lsn</parameter> is reached, an error is emitted,
- unless <literal>NO_THROW</literal> is specified in the WITH clause.
- If <parameter>NO_THROW</parameter> is specified, then the command
- doesn't throw errors.
+ Waits until the specified <parameter>lsn</parameter> is reached
+ according to the specified <parameter>mode</parameter>,
+ which determines whether to wait for WAL to be written, flushed, or replayed.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
</para>
<para>
- The possible return values are <literal>success</literal>,
- <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
</para>
</refsect1>
@@ -64,6 +67,61 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>MODE</literal></term>
+ <listitem>
+ <para>
+ Specifies the type of LSN processing to wait for. If not specified,
+ the default is <literal>REPLAY</literal>. The valid modes are:
+ </para>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>REPLAY</literal></term>
+ <listitem>
+ <para>
+ Wait for the LSN to be replayed (applied to the database).
+ After successful completion, <function>pg_last_wal_replay_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>FLUSH</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and flushed to disk. This provides a durability guarantee
+ without waiting for the WAL to be applied. After successful
+ completion, <function>pg_last_wal_receive_lsn()</function>
+ will return a value greater than or equal to the target LSN.
+ This value is also available as the <structfield>flushed_lsn</structfield>
+ column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+ <structname>pg_stat_wal_receiver</structname></link>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>WRITE</literal></term>
+ <listitem>
+ <para>
+ Wait for the WAL containing the LSN to be received from the
+ primary and written to disk, but not yet flushed. This is faster
+ than <literal>FLUSH</literal> but provides weaker durability
+ guarantees since the data may still be in operating system buffers.
+ After successful completion, the <structfield>written_lsn</structfield>
+ column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+ <structname>pg_stat_wal_receiver</structname></link> will show
+ a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>WITH ( <replaceable class="parameter">option</replaceable> [, ...] )</literal></term>
<listitem>
@@ -135,9 +193,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<listitem>
<para>
This return value denotes that the database server is not in a recovery
- state. This might mean either the database server was not in recovery
- at the moment of receiving the command, or it was promoted before
- reaching the target <parameter>lsn</parameter>.
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command (i.e., executed on a primary),
+ or it was promoted before reaching the target <parameter>lsn</parameter>.
+ In the promotion case, this status indicates a timeline change occurred,
+ and the application should re-evaluate whether the target LSN is still
+ relevant.
</para>
</listitem>
</varlistentry>
@@ -148,25 +209,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Notes</title>
<para>
- <command>WAIT FOR</command> command waits till
- <parameter>lsn</parameter> to be replayed on standby.
- That is, after this command execution, the value returned by
- <function>pg_last_wal_replay_lsn</function> should be greater or equal
- to the <parameter>lsn</parameter> value. This is useful to achieve
- read-your-writes-consistency, while using async replica for reads and
- primary for writes. In that case, the <acronym>lsn</acronym> of the last
- modification should be stored on the client application side or the
- connection pooler side.
+ <command>WAIT FOR</command> waits until the specified
+ <parameter>lsn</parameter> is reached according to the specified
+ <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+ for the LSN to be replayed (applied to the database), which is useful
+ to achieve read-your-writes consistency while using an async replica
+ for reads and the primary for writes. The <literal>FLUSH</literal> mode
+ waits for the WAL to be flushed to durable storage on the replica,
+ providing a durability guarantee without waiting for replay. The
+ <literal>WRITE</literal> mode waits for the WAL to be written to the
+ operating system, which is faster than flush but provides weaker
+ durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+ last modification should be stored on the client application side or
+ the connection pooler side.
</para>
<para>
- <command>WAIT FOR</command> command should be called on standby.
- If a user runs <command>WAIT FOR</command> on primary, it
- will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
- However, if <command>WAIT FOR</command> is
- called on primary promoted from standby and <literal>lsn</literal>
- was already replayed, then the <command>WAIT FOR</command> command just
- exits immediately.
+ <command>WAIT FOR</command> should be called on a standby.
+ If a user runs <command>WAIT FOR</command> on the primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified
+ in the WITH clause. However, if <command>WAIT FOR</command> is
+ called on a primary promoted from standby and <literal>lsn</literal>
+ was already reached, then the <command>WAIT FOR</command> command
+ just exits immediately. If the replica is promoted while waiting,
+ the command will return <literal>not in recovery</literal> (or throw
+ an error if <literal>NO_THROW</literal> is not specified). Promotion
+ creates a new timeline, and the LSN being waited for may refer to
+ WAL from the old timeline.
</para>
</refsect1>
@@ -175,21 +244,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Examples</title>
<para>
- You can use <command>WAIT FOR</command> command to wait for
- the <type>pg_lsn</type> value. For example, an application could update
- the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
- changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
- on primary server to get the <acronym>lsn</acronym> given that
- <varname>synchronous_commit</varname> could be set to
- <literal>off</literal>.
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
<programlisting>
postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
UPDATE 100
postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
(1 row)
</programlisting>
@@ -198,9 +267,9 @@ pg_current_wal_insert_lsn
changes made on primary should be guaranteed to be visible on replica.
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20';
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY;
status
---------
+---------
success
(1 row)
postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,21 +280,46 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
</para>
<para>
- If the target LSN is not reached before the timeout, the error is thrown.
+ Wait for flush (data durable on replica):
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
+postgres=# WAIT FOR LSN '0/306EE20' MODE FLUSH;
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE WRITE WITH (TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, an error is thrown:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '0.1s');
ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
</programlisting>
</para>
<para>
The same example uses <command>WAIT FOR</command> with
- <parameter>NO_THROW</parameter> option.
+ <parameter>NO_THROW</parameter> option:
+
<programlisting>
-postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
+postgres=# WAIT FOR LSN '0/306EE20' MODE REPLAY WITH (TIMEOUT '100ms', NO_THROW);
status
---------
+---------
timeout
(1 row)
</programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6e348f2109..5c6f9feeccc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6238,10 +6238,12 @@ StartupXLOG(void)
LWLockRelease(ControlFileLock);
/*
- * Wake up all waiters for replay LSN. They need to report an error that
- * recovery was ended before reaching the target LSN.
+ * Wake up all waiters. They need to report an error that recovery was
+ * ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..60cf3ee1c9a 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
*
* wait.c
* Implements WAIT FOR, which allows waiting for events such as
- * time passing or LSN having been replayed on replica.
+ * time passing or LSN having been replayed, flushed, or written.
*
* Portions Copyright (c) 2025, PostgreSQL Global Development Group
*
@@ -15,6 +15,7 @@
#include <math.h>
+#include "access/xlog.h"
#include "access/xlogrecovery.h"
#include "access/xlogwait.h"
#include "commands/defrem.h"
@@ -28,12 +29,28 @@
#include "utils/snapmgr.h"
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+ const char *noun; /* "replay", "flush", "write" */
+ const char *verb; /* "replayed", "flushed", "written" */
+} WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+ [WAIT_LSN_TYPE_STANDBY_REPLAY] = {"replay", "replayed"},
+ [WAIT_LSN_TYPE_STANDBY_WRITE] = {"write", "written"},
+ [WAIT_LSN_TYPE_STANDBY_FLUSH] = {"flush", "flushed"},
+};
+
void
ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
{
XLogRecPtr lsn;
int64 timeout = 0;
WaitLSNResult waitLSNResult;
+ WaitLSNType lsnType;
bool throw = true;
TupleDesc tupdesc;
TupOutputState *tstate;
@@ -45,6 +62,22 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
CStringGetDatum(stmt->lsn_literal)));
+ /* Convert parse-time WaitLSNMode to runtime WaitLSNType */
+ switch (stmt->mode)
+ {
+ case WAIT_LSN_MODE_REPLAY:
+ lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+ break;
+ case WAIT_LSN_MODE_WRITE:
+ lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+ break;
+ case WAIT_LSN_MODE_FLUSH:
+ lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+ break;
+ default:
+ elog(ERROR, "unrecognized wait mode: %d", stmt->mode);
+ }
+
foreach_node(DefElem, defel, stmt->options)
{
if (strcmp(defel->defname, "timeout") == 0)
@@ -107,8 +140,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
}
/*
- * We are going to wait for the LSN replay. We should first care that we
- * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * We are going to wait for the LSN. We should first care that we don't
+ * hold a snapshot and correspondingly our MyProc->xmin is invalid.
* Otherwise, our snapshot could prevent the replay of WAL records
* implying a kind of self-deadlock. This is the reason why WAIT FOR is a
* command, not a procedure or function.
@@ -140,7 +173,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
@@ -154,11 +187,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_TIMEOUT:
if (throw)
+ {
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
- errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
+ }
else
result = "timeout";
break;
@@ -166,20 +206,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errhint("Waiting for the replay LSN can only be executed during recovery."));
+ errhint("Waiting for the %s LSN can only be executed during recovery.",
+ desc->noun));
}
else
result = "not in recovery";
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 28f4e11e30f..94a9e874699 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -641,6 +641,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <windef> window_definition over_clause window_specification
opt_frame_clause frame_extent frame_bound
%type <ival> null_treatment opt_window_exclusion_clause
+%type <ival> opt_wait_lsn_mode
%type <str> opt_existing_window_name
%type <boolean> opt_if_not_exists
%type <boolean> opt_unique_null_treatment
@@ -732,7 +733,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
ESCAPE EVENT EXCEPT EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
EXPRESSION EXTENSION EXTERNAL EXTRACT
- FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FOLLOWING FOR
+ FALSE_P FAMILY FETCH FILTER FINALIZE FIRST_P FLOAT_P FLUSH FOLLOWING FOR
FORCE FOREIGN FORMAT FORWARD FREEZE FROM FULL FUNCTION FUNCTIONS
GENERATED GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING GROUPS
@@ -773,7 +774,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE QUOTES
RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
- REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+ REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLAY REPLICA
RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
ROUTINE ROUTINES ROW ROWS RULE
@@ -16541,15 +16542,23 @@ xml_passing_mech:
*****************************************************************************/
WaitStmt:
- WAIT FOR LSN_P Sconst opt_wait_with_clause
+ WAIT FOR LSN_P Sconst opt_wait_lsn_mode opt_wait_with_clause
{
WaitStmt *n = makeNode(WaitStmt);
n->lsn_literal = $4;
- n->options = $5;
+ n->mode = $5;
+ n->options = $6;
$$ = (Node *) n;
}
;
+opt_wait_lsn_mode:
+ MODE REPLAY { $$ = WAIT_LSN_MODE_REPLAY; }
+ | MODE FLUSH { $$ = WAIT_LSN_MODE_FLUSH; }
+ | MODE WRITE { $$ = WAIT_LSN_MODE_WRITE; }
+ | /*EMPTY*/ { $$ = WAIT_LSN_MODE_REPLAY; }
+ ;
+
opt_wait_with_clause:
WITH '(' utility_option_list ')' { $$ = $3; }
| /*EMPTY*/ { $$ = NIL; }
@@ -17989,6 +17998,7 @@ unreserved_keyword:
| FILTER
| FINALIZE
| FIRST_P
+ | FLUSH
| FOLLOWING
| FORCE
| FORMAT
@@ -18124,6 +18134,7 @@ unreserved_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESPECT_P
@@ -18578,6 +18589,7 @@ bare_label_keyword:
| FINALIZE
| FIRST_P
| FLOAT_P
+ | FLUSH
| FOLLOWING
| FORCE
| FOREIGN
@@ -18761,6 +18773,7 @@ bare_label_keyword:
| RENAME
| REPEATABLE
| REPLACE
+ | REPLAY
| REPLICA
| RESET
| RESTART
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..e15c5645b9c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "catalog/pg_authid.h"
#include "funcapi.h"
#include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
/* Update shared-memory status */
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+ /*
+ * If we wrote an LSN that someone was waiting for then walk over the
+ * shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Write >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
/*
* Close the current segment if it's fully written up in the last cycle of
* the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
}
SpinLockRelease(&walrcv->mutex);
+ /*
+ * If we flushed an LSN that someone was waiting for then walk over
+ * the shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Flush >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index bc7adba4a0f..c4d9f03a6a5 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -4413,10 +4413,21 @@ typedef struct DropSubscriptionStmt
DropBehavior behavior; /* RESTRICT or CASCADE behavior */
} DropSubscriptionStmt;
+/*
+ * WaitLSNMode - MODE parameter for WAIT FOR command
+ */
+typedef enum WaitLSNMode
+{
+ WAIT_LSN_MODE_REPLAY, /* Wait for LSN replay on standby */
+ WAIT_LSN_MODE_WRITE, /* Wait for LSN write on standby */
+ WAIT_LSN_MODE_FLUSH /* Wait for LSN flush on standby */
+} WaitLSNMode;
+
typedef struct WaitStmt
{
NodeTag type;
char *lsn_literal; /* LSN string from grammar */
+ WaitLSNMode mode; /* Wait mode: REPLAY/FLUSH/WRITE */
List *options; /* List of DefElem nodes */
} WaitStmt;
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 9fde58f541c..04008805e46 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -176,6 +176,7 @@ PG_KEYWORD("filter", FILTER, UNRESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("finalize", FINALIZE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("first", FIRST_P, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("float", FLOAT_P, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("flush", FLUSH, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("following", FOLLOWING, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("for", FOR, RESERVED_KEYWORD, AS_LABEL)
PG_KEYWORD("force", FORCE, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -379,6 +380,7 @@ PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("replay", REPLAY, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("reset", RESET, UNRESERVED_KEYWORD, BARE_LABEL)
PG_KEYWORD("respect", RESPECT_P, UNRESERVED_KEYWORD, AS_LABEL)
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..98060a5c79f 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
# the WAIT FOR command.
use strict;
use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+ my ($node) = @_;
+ $saved_primary_conninfo = $node->safe_psql('postgres',
+ "SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+ my ($node) = @_;
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
# Initialize primary node
my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
ok((split("\n", $output))[-1] eq 30,
"standby reached the same LSN as primary");
-# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_write}' MODE WRITE WITH (timeout '1d');
+ SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby wrote WAL up to target LSN after WAIT FOR MODE WRITE");
+
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_flush}' MODE FLUSH WITH (timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby flushed WAL up to target LSN after WAIT FOR MODE FLUSH");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout. The
# unreachable LSN must be well in advance. So WAL records issued by
# the concurrent autovacuum could not affect that.
my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
# within another function, or inside a transaction with an isolation level
# higher than READ COMMITTED.
@@ -125,7 +184,7 @@ ok( $stderr =~
/WAIT FOR must be only called without an active or registered snapshot/,
"get an error when running within another function");
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
my $test_lsn =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
@@ -208,7 +267,7 @@ $node_standby->psql(
ok( $stderr =~ /option "invalid_option" not recognized/,
"get error for invalid WITH clause option");
-# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# 7a. Check the scenario of multiple REPLAY waiters. We make 5 background
# psql sessions each waiting for a corresponding insertion. When waiting is
# finished, stored procedures logs if there are visible as many rows as
# should be.
@@ -226,7 +285,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
\$\$
LANGUAGE plpgsql;
]);
+
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
my @psql_sessions;
for (my $i = 0; $i < 5; $i++)
{
@@ -239,10 +300,11 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
- WAIT FOR LSN '${lsn}';
+ WAIT FOR LSN '${lsn}' MODE REPLAY;
SELECT log_count(${i});
]);
}
+
my $log_offset = -s $node_standby->logfile;
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
for (my $i = 0; $i < 5; $i++)
@@ -251,23 +313,200 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->quit;
}
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
-# 7. Check that the standby promotion terminates the wait on LSN. Start
-# waiting for an unreachable LSN then promote. Check the log for the relevant
-# error message. Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (100 + ${i});");
+ $write_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $write_sessions[$i] = $node_standby->background_psql('postgres');
+ $write_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$write_lsns[$i]' MODE WRITE WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("write_done $i", $write_log_offset);
+ $write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (200 + ${i});");
+ $flush_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $flush_sessions[$i] = $node_standby->background_psql('postgres');
+ $flush_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$flush_lsns[$i]' MODE FLUSH WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+ $flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently. We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN. We stop the walreceiver and pause replay to
+# ensure all waiters block. Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
+
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('REPLAY', 'WRITE', 'FLUSH');
+for (my $i = 0; $i < 6; $i++)
+{
+ $mixed_sessions[$i] = $node_standby->background_psql('postgres');
+ $mixed_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${mixed_target_lsn}' MODE $mixed_modes[$i % 3] WITH (timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+ "SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+ $node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+ $mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+ "mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes. Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote. Check the log for the relevant error messages. Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
my $lsn4 =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
my $lsn5 =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
- qr/start/, qq[
- \\echo start
- WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('REPLAY', 'WRITE', 'FLUSH');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i] = $node_standby->background_psql('postgres');
+ $wait_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}' MODE $wait_modes[$i];
+ ]);
+}
# Make sure standby will be promoted at least at the primary insert LSN we
# have just observed. Use pg_switch_wal() to force the insert LSN to be
@@ -277,17 +516,24 @@ $node_primary->wait_for_catchup($node_standby);
$log_offset = -s $node_standby->logfile;
$node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+ $log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+ $log_offset);
+$node_standby->wait_for_log(
+ qr/Recovery ended before target LSN.*was replayed/, $log_offset);
-$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
+ok(1, 'promotion interrupted all wait modes');
+
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}' MODE REPLAY;");
ok(1, 'wait for already replayed LSN exits immediately even after promotion');
$output = $node_standby->safe_psql(
'postgres', qq[
- WAIT FOR LSN '${lsn4}' WITH (timeout '10ms', no_throw);]);
+ WAIT FOR LSN '${lsn4}' MODE REPLAY WITH (timeout '10ms', no_throw);]);
ok($output eq "not in recovery",
"WAIT FOR returns correct status after standby promotion");
@@ -295,8 +541,11 @@ ok($output eq "not in recovery",
$node_standby->stop;
$node_primary->stop;
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
# already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i]->{run}->finish;
+}
done_testing();
--
2.51.0
v6-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchapplication/octet-stream; name=v6-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-paramete.patchDownload
From 94d36b07298fa2a46d26623c08a269cc6db6461a Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v6 3/4] Add tab completion for WAIT FOR LSN MODE parameter
Update psql tab completion to support the optional MODE parameter in
WAIT FOR LSN command. After specifying an LSN value, completion now
offers both MODE and WITH keywords since MODE defaults to REPLAY.
---
src/bin/psql/tab-complete.in.c | 39 ++++++++++++++++++++++++----------
1 file changed, 28 insertions(+), 11 deletions(-)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b1ff6f6cd94..8f269b5cb13 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5327,10 +5327,11 @@ match_previous_words(int pattern_id,
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_vacuumables);
/*
- * WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
+ * WAIT FOR LSN '<lsn>' [ MODE { REPLAY | FLUSH | WRITE } ] [ WITH ( option [, ...] ) ]
* where option can be:
* TIMEOUT '<timeout>'
* NO_THROW
+ * MODE defaults to REPLAY if not specified.
*/
else if (Matches("WAIT"))
COMPLETE_WITH("FOR");
@@ -5339,25 +5340,41 @@ match_previous_words(int pattern_id,
else if (Matches("WAIT", "FOR", "LSN"))
/* No completion for LSN value - user must provide manually */
;
+
+ /*
+ * After LSN value, offer MODE (optional) or WITH, since MODE defaults to
+ * REPLAY
+ */
else if (Matches("WAIT", "FOR", "LSN", MatchAny))
+ COMPLETE_WITH("MODE", "WITH");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE"))
+ COMPLETE_WITH("REPLAY", "FLUSH", "WRITE");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny))
COMPLETE_WITH("WITH");
+ /* WITH directly after LSN (using default REPLAY mode) */
else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
COMPLETE_WITH("(");
+ else if (Matches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH"))
+ COMPLETE_WITH("(");
+
+ /*
+ * Handle parenthesized option list (both with and without explicit MODE).
+ * This fires when we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as one
+ * word, so the above test is correct. timeout takes a string value,
+ * no_throw takes no value. We don't offer completions for these values.
+ */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
- /*
- * This fires if we're in an unfinished parenthesized option list.
- * get_previous_words treats a completed parenthesized option list as
- * one word, so the above test is correct.
- */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("timeout", "no_throw");
-
- /*
- * timeout takes a string value, no_throw takes no value. We don't
- * offer completions for these values.
- */
+ }
+ else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*") &&
+ !HeadMatches("WAIT", "FOR", "LSN", MatchAny, "MODE", MatchAny, "WITH", "(*)"))
+ {
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("timeout", "no_throw");
}
/* WITH [RECURSIVE] */
--
2.51.0
v6-0004-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v6-0004-Use-WAIT-FOR-LSN-in.patchDownload
From 7265330a02c5d966ef42cce3f9c15f4acae37ff4 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v6 4/4] Use WAIT FOR LSN in
PostgreSQL::Test::Cluster::wait_for_catchup()
Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.
The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 35 ++++++++++++++++++++++--
1 file changed, 32 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..eec8233b515 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3335,6 +3335,9 @@ sub wait_for_catchup
$mode = defined($mode) ? $mode : 'replay';
my %valid_modes =
('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+ my $isrecovery =
+ $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+ chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
- my $isrecovery =
- $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
- chomp($isrecovery);
if ($isrecovery eq 't')
{
$target_lsn = $self->lsn('replay');
@@ -3367,6 +3367,35 @@ sub wait_for_catchup
. $self->name . "\n";
# Before release 12 walreceiver just set the application name to
# "walreceiver"
+
+ # Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+ # This is more efficient than polling pg_stat_replication
+ if (($mode ne 'sent') && ($isrecovery eq 't'))
+ {
+ my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+ # Map mode names to WAIT FOR LSN MODE values (uppercase)
+ my $wait_mode = uc($mode);
+ my $query =
+ qq[WAIT FOR LSN '${target_lsn}' MODE ${wait_mode} WITH (timeout '${timeout}s', no_throw);];
+ my $output = $self->safe_psql('postgres', $query);
+ chomp($output);
+
+ if ($output ne 'success')
+ {
+ # Fetch additional detail for debugging purposes
+ $query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+ my $details = $self->safe_psql('postgres', $query);
+ diag qq(WAIT FOR LSN failed with status:
+${output});
+ diag qq(Last pg_stat_replication contents:
+${details});
+ croak "failed waiting for catchup";
+ }
+ print "done\n";
+ return;
+ }
+
+ # Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('$standby_name', 'walreceiver')];
--
2.51.0
On Oct 4, 2025, at 09:35, Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Álvaro,
Thanks for your review.
On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Sep-15, Alexander Korotkov wrote:
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something likeWAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.Shall make LSN an unreserved keyword as well.
Here's the updated v11. Many thanks Jian for off-list discussions and review.
v12 removed unused +WaitStmt +WaitStmtParam in pgindent/typedefs.list.Best,
Xuneng
<v12-0001-Implement-WAIT-FOR-command.patch>
I just tried to review v12 but failed to “git am”. Can you please rebase the change?
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Hi,
On Tue, Dec 16, 2025 at 1:49 PM Chao Li <li.evan.chao@gmail.com> wrote:
On Oct 4, 2025, at 09:35, Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Sun, Sep 28, 2025 at 5:02 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi,
On Fri, Sep 26, 2025 at 7:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Álvaro,
Thanks for your review.
On Tue, Sep 16, 2025 at 4:24 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Sep-15, Alexander Korotkov wrote:
It's LGTM. The same pattern is observed in VACUUM, EXPLAIN, and CREATE
PUBLICATION - all use minimal grammar rules that produce generic
option lists, with the actual interpretation done in their respective
implementation files. The moderate complexity in wait.c seems
acceptable.Actually I find the code in ExecWaitStmt pretty unusual. We tend to use
lists of DefElem (a name optionally followed by a value) instead of
individual scattered elements that must later be matched up. Why not
use utility_option_list instead and then loop on the list of DefElems?
It'd be a lot simpler.I took a look at commands like VACUUM and EXPLAIN and they do follow
this pattern. v11 will make use of utility_option_list.Also, we've found that failing to surround the options by parens leads
to pain down the road, so maybe add that. Given that the LSN seems to
be mandatory, maybe make it something likeWAIT FOR LSN 'xy/zzy' [ WITH ( utility_option_list ) ]
This requires that you make LSN a keyword, albeit unreserved. Or you
could make it
WAIT FOR Ident [the rest]
and then ensure in C that the identifier matches the word LSN, such as
we do for "permissive" and "restrictive" in
RowSecurityDefaultPermissive.Shall make LSN an unreserved keyword as well.
Here's the updated v11. Many thanks Jian for off-list discussions and review.
v12 removed unused +WaitStmt +WaitStmtParam in pgindent/typedefs.list.Best,
Xuneng
<v12-0001-Implement-WAIT-FOR-command.patch>I just tried to review v12 but failed to “git am”. Can you please rebase the change?
Thanks for looking into this.
That series of patches implementing the WAIT FOR REPLAY command was
applied last month (8af3ae0d , 447aae13, 3b4e53a0, a1f7f91b) in its
version 20. The current v6 patch set [1]https://commitfest.postgresql.org/patch/6265/ [2]/messages/by-id/CABPTF7XKti620ZAOXPGuhSZxCKyaV_9stq7ruhnuxvshUxCeRQ@mail.gmail.com primarily extends the
WAIT FOR functionality to support waiting for flush and write LSNs on
a replica by adding a MODE parameter [3]/messages/by-id/CAPpHfdt4b0wBC4+Oopp_eFQnNjDvxwQLrQ1r4GMJfCY0XWP0dA@mail.gmail.com. This made me wonder whether
it would be more appropriate to start a new thread for the extension,
though it is still part of the same WAIT FOR command.
[1]: https://commitfest.postgresql.org/patch/6265/
[2]: /messages/by-id/CABPTF7XKti620ZAOXPGuhSZxCKyaV_9stq7ruhnuxvshUxCeRQ@mail.gmail.com
[3]: /messages/by-id/CAPpHfdt4b0wBC4+Oopp_eFQnNjDvxwQLrQ1r4GMJfCY0XWP0dA@mail.gmail.com
--
Best,
Xuneng
Hi, Xuneng!
On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.
Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?
------
Regards,
Alexander Korotkov
Hi Alexander,
On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi, Xuneng!
On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?
Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:
-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');
-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');
-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');
If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.
--
Best,
Xuneng
On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi, Xuneng!
On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.
This is exactly what I meant. Please, go ahead.
------
Regards,
Alexander Korotkov
Supabase
Hi,
On Thu, Dec 18, 2025 at 8:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi, Xuneng!
On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.This is exactly what I meant. Please, go ahead.
Here is the updated patch set (v7). Please check.
--
Best,
Xuneng
Attachments:
v7-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchapplication/octet-stream; name=v7-0001-Extend-xlogwait-infrastructure-with-write-and-flu.patchDownload
From bbf69248589db7056b05ab996ec1831aa7fbb2b5 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:21:36 +0800
Subject: [PATCH v7 1/4] Extend xlogwait infrastructure with write and flush
wait types
Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.
Key changes:
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType
- Add GetCurrentLSNForWaitType() to retrieve current LSN for each wait type
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally
---
src/backend/access/transam/xlog.c | 2 +-
src/backend/access/transam/xlogrecovery.c | 4 +-
src/backend/access/transam/xlogwait.c | 81 ++++++++++++++-----
src/backend/commands/wait.c | 2 +-
.../utils/activity/wait_event_names.txt | 3 +-
src/include/access/xlogwait.h | 14 +++-
6 files changed, 78 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a5640df51a..a6e348f2109 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6241,7 +6241,7 @@ StartupXLOG(void)
* Wake up all waiters for replay LSN. They need to report an error that
* recovery was ended before reaching the target LSN.
*/
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ae2398d6975..01ffe30ffee 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1846,8 +1846,8 @@ PerformWalRecovery(void)
*/
if (waitLSNState &&
(XLogRecoveryCtl->lastReplayedEndRecPtr >=
- pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_REPLAY])))
- WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_REPLAY])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, XLogRecoveryCtl->lastReplayedEndRecPtr);
/* Else, try to fetch the next WAL record */
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
diff --git a/src/backend/access/transam/xlogwait.c b/src/backend/access/transam/xlogwait.c
index 6109381c0f0..d54b2fd7ae4 100644
--- a/src/backend/access/transam/xlogwait.c
+++ b/src/backend/access/transam/xlogwait.c
@@ -12,25 +12,30 @@
* This file implements waiting for WAL operations to reach specific LSNs
* on both physical standby and primary servers. The core idea is simple:
* every process that wants to wait publishes the LSN it needs to the
- * shared memory, and the appropriate process (startup on standby, or
- * WAL writer/backend on primary) wakes it once that LSN has been reached.
+ * shared memory, and the appropriate process (startup on standby,
+ * walreceiver on standby, or WAL writer/backend on primary) wakes it
+ * once that LSN has been reached.
*
* The shared memory used by this module comprises a procInfos
* per-backend array with the information of the awaited LSN for each
* of the backend processes. The elements of that array are organized
- * into a pairing heap waitersHeap, which allows for very fast finding
- * of the least awaited LSN.
+ * into pairing heaps (waitersHeap), one for each WaitLSNType, which
+ * allows for very fast finding of the least awaited LSN for each type.
*
- * In addition, the least-awaited LSN is cached as minWaitedLSN. The
- * waiter process publishes information about itself to the shared
- * memory and waits on the latch until it is woken up by the appropriate
- * process, standby is promoted, or the postmaster dies. Then, it cleans
- * information about itself in the shared memory.
+ * In addition, the least-awaited LSN for each type is cached in the
+ * minWaitedLSN array. The waiter process publishes information about
+ * itself to the shared memory and waits on the latch until it is woken
+ * up by the appropriate process, standby is promoted, or the postmaster
+ * dies. Then, it cleans information about itself in the shared memory.
*
- * On standby servers: After replaying a WAL record, the startup process
- * first performs a fast path check minWaitedLSN > replayLSN. If this
- * check is negative, it checks waitersHeap and wakes up the backend
- * whose awaited LSNs are reached.
+ * On standby servers:
+ * - After replaying a WAL record, the startup process performs a fast
+ * path check minWaitedLSN[REPLAY] > replayLSN. If this check is
+ * negative, it checks waitersHeap[REPLAY] and wakes up the backends
+ * whose awaited LSNs are reached.
+ * - After receiving WAL, the walreceiver process performs similar checks
+ * against the flush and write LSNs, waking up waiters in the FLUSH
+ * and WRITE heaps respectively.
*
* On primary servers: After flushing WAL, the WAL writer or backend
* process performs a similar check against the flush LSN and wakes up
@@ -49,6 +54,7 @@
#include "access/xlogwait.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "replication/walreceiver.h"
#include "storage/latch.h"
#include "storage/proc.h"
#include "storage/shmem.h"
@@ -62,6 +68,45 @@ static int waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
struct WaitLSNState *waitLSNState = NULL;
+/*
+ * Wait event for each WaitLSNType, used with WaitLatch() to report
+ * the wait in pg_stat_activity.
+ */
+static const uint32 WaitLSNWaitEvents[] = {
+ [WAIT_LSN_TYPE_STANDBY_REPLAY] = WAIT_EVENT_WAIT_FOR_WAL_REPLAY,
+ [WAIT_LSN_TYPE_STANDBY_WRITE] = WAIT_EVENT_WAIT_FOR_WAL_WRITE,
+ [WAIT_LSN_TYPE_STANDBY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+ [WAIT_LSN_TYPE_PRIMARY_FLUSH] = WAIT_EVENT_WAIT_FOR_WAL_FLUSH,
+};
+
+StaticAssertDecl(lengthof(WaitLSNWaitEvents) == WAIT_LSN_TYPE_COUNT,
+ "WaitLSNWaitEvents must match WaitLSNType enum");
+
+/*
+ * Get the current LSN for the specified wait type.
+ */
+XLogRecPtr
+GetCurrentLSNForWaitType(WaitLSNType lsnType)
+{
+ switch (lsnType)
+ {
+ case WAIT_LSN_TYPE_STANDBY_REPLAY:
+ return GetXLogReplayRecPtr(NULL);
+
+ case WAIT_LSN_TYPE_STANDBY_WRITE:
+ return GetWalRcvWriteRecPtr();
+
+ case WAIT_LSN_TYPE_STANDBY_FLUSH:
+ return GetWalRcvFlushRecPtr(NULL, NULL);
+
+ case WAIT_LSN_TYPE_PRIMARY_FLUSH:
+ return GetFlushRecPtr(NULL);
+ }
+
+ elog(ERROR, "invalid LSN wait type: %d", lsnType);
+ pg_unreachable();
+}
+
/* Report the amount of shared memory space needed for WaitLSNState. */
Size
WaitLSNShmemSize(void)
@@ -341,13 +386,11 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
int rc;
long delay_ms = -1;
- if (lsnType == WAIT_LSN_TYPE_REPLAY)
- currentLSN = GetXLogReplayRecPtr(NULL);
- else
- currentLSN = GetFlushRecPtr(NULL);
+ /* Get current LSN for the wait type */
+ currentLSN = GetCurrentLSNForWaitType(lsnType);
/* Check that recovery is still in-progress */
- if (lsnType == WAIT_LSN_TYPE_REPLAY && !RecoveryInProgress())
+ if (lsnType != WAIT_LSN_TYPE_PRIMARY_FLUSH && !RecoveryInProgress())
{
/*
* Recovery was ended, but check if target LSN was already
@@ -376,7 +419,7 @@ WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN, int64 timeout)
CHECK_FOR_INTERRUPTS();
rc = WaitLatch(MyLatch, wake_events, delay_ms,
- (lsnType == WAIT_LSN_TYPE_REPLAY) ? WAIT_EVENT_WAIT_FOR_WAL_REPLAY : WAIT_EVENT_WAIT_FOR_WAL_FLUSH);
+ WaitLSNWaitEvents[lsnType]);
/*
* Emergency bailout if postmaster has died. This is to avoid the
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index a37bddaefb2..dd2570cb787 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -140,7 +140,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c0632bf901a..05bd4376c67 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -89,8 +89,9 @@ LIBPQWALRECEIVER_CONNECT "Waiting in WAL receiver to establish connection to rem
LIBPQWALRECEIVER_RECEIVE "Waiting in WAL receiver to receive data from remote server."
SSL_OPEN_SERVER "Waiting for SSL while attempting connection."
WAIT_FOR_STANDBY_CONFIRMATION "Waiting for WAL to be received and flushed by the physical standby."
-WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary."
+WAIT_FOR_WAL_FLUSH "Waiting for WAL flush to reach a target LSN on a primary or standby."
WAIT_FOR_WAL_REPLAY "Waiting for WAL replay to reach a target LSN on a standby."
+WAIT_FOR_WAL_WRITE "Waiting for WAL write to reach a target LSN on a standby."
WAL_SENDER_WAIT_FOR_WAL "Waiting for WAL to be flushed in WAL sender process."
WAL_SENDER_WRITE_DATA "Waiting for any activity when processing replies from WAL receiver in WAL sender process."
diff --git a/src/include/access/xlogwait.h b/src/include/access/xlogwait.h
index 3e8fcbd9177..4cf13f0ccb3 100644
--- a/src/include/access/xlogwait.h
+++ b/src/include/access/xlogwait.h
@@ -1,7 +1,7 @@
/*-------------------------------------------------------------------------
*
* xlogwait.h
- * Declarations for LSN replay waiting routines.
+ * Declarations for WAL flush, write, and replay waiting routines.
*
* Copyright (c) 2025, PostgreSQL Global Development Group
*
@@ -35,11 +35,16 @@ typedef enum
*/
typedef enum WaitLSNType
{
- WAIT_LSN_TYPE_REPLAY, /* Waiting for replay on standby */
- WAIT_LSN_TYPE_FLUSH, /* Waiting for flush on primary */
+ /* Standby wait types (walreceiver/startup wakes) */
+ WAIT_LSN_TYPE_STANDBY_REPLAY,
+ WAIT_LSN_TYPE_STANDBY_WRITE,
+ WAIT_LSN_TYPE_STANDBY_FLUSH,
+
+ /* Primary wait types (WAL writer/backends wake) */
+ WAIT_LSN_TYPE_PRIMARY_FLUSH,
} WaitLSNType;
-#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_FLUSH + 1)
+#define WAIT_LSN_TYPE_COUNT (WAIT_LSN_TYPE_PRIMARY_FLUSH + 1)
/*
* WaitLSNProcInfo - the shared memory structure representing information
@@ -97,6 +102,7 @@ extern PGDLLIMPORT WaitLSNState *waitLSNState;
extern Size WaitLSNShmemSize(void);
extern void WaitLSNShmemInit(void);
+extern XLogRecPtr GetCurrentLSNForWaitType(WaitLSNType lsnType);
extern void WaitLSNWakeup(WaitLSNType lsnType, XLogRecPtr currentLSN);
extern void WaitLSNCleanup(void);
extern WaitLSNResult WaitForLSN(WaitLSNType lsnType, XLogRecPtr targetLSN,
--
2.51.0
v7-0004-Use-WAIT-FOR-LSN-in.patchapplication/octet-stream; name=v7-0004-Use-WAIT-FOR-LSN-in.patchDownload
From 9dde4e330844d827f783ab2caca505036ac884b0 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:03:23 +0800
Subject: [PATCH v7 4/4] Use WAIT FOR LSN in
PostgreSQL::Test::Cluster::wait_for_catchup()
Replace polling-based catchup waiting with WAIT FOR LSN command when
running on a standby server. This is more efficient than repeatedly
querying pg_stat_replication as the WAIT FOR command uses the latch-
based wakeup mechanism.
The optimization applies when:
- The node is in recovery (standby server)
- The mode is 'replay', 'write', or 'flush' (not 'sent')
For 'sent' mode or when running on a primary, the function falls back
to the original polling approach since WAIT FOR LSN is only available
during recovery.
---
src/test/perl/PostgreSQL/Test/Cluster.pm | 33 +++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 295988b8b87..276350c5f13 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3335,6 +3335,9 @@ sub wait_for_catchup
$mode = defined($mode) ? $mode : 'replay';
my %valid_modes =
('sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1);
+ my $isrecovery =
+ $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
+ chomp($isrecovery);
croak "unknown mode $mode for 'wait_for_catchup', valid modes are "
. join(', ', keys(%valid_modes))
unless exists($valid_modes{$mode});
@@ -3347,9 +3350,6 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
- my $isrecovery =
- $self->safe_psql('postgres', "SELECT pg_is_in_recovery()");
- chomp($isrecovery);
if ($isrecovery eq 't')
{
$target_lsn = $self->lsn('replay');
@@ -3367,6 +3367,33 @@ sub wait_for_catchup
. $self->name . "\n";
# Before release 12 walreceiver just set the application name to
# "walreceiver"
+
+ # Use WAIT FOR LSN when in recovery for supported modes (replay, write, flush)
+ # This is more efficient than polling pg_stat_replication
+ if (($mode ne 'sent') && ($isrecovery eq 't'))
+ {
+ my $timeout = $PostgreSQL::Test::Utils::timeout_default;
+ my $query =
+ qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${mode}', timeout '${timeout}s', no_throw);];
+ my $output = $self->safe_psql('postgres', $query);
+ chomp($output);
+
+ if ($output ne 'success')
+ {
+ # Fetch additional detail for debugging purposes
+ $query = qq[SELECT * FROM pg_catalog.pg_stat_replication];
+ my $details = $self->safe_psql('postgres', $query);
+ diag qq(WAIT FOR LSN failed with status:
+${output});
+ diag qq(Last pg_stat_replication contents:
+${details});
+ croak "failed waiting for catchup";
+ }
+ print "done\n";
+ return;
+ }
+
+ # Polling for 'sent' mode or when not in recovery (WAIT FOR LSN not applicable)
my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('$standby_name', 'walreceiver')];
--
2.51.0
v7-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchapplication/octet-stream; name=v7-0003-Add-tab-completion-for-WAIT-FOR-LSN-MODE-option.patchDownload
From 62db341638bd9515584f9c24b0adfeec61ada252 Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 11:00:25 +0800
Subject: [PATCH v7 3/4] Add tab completion for WAIT FOR LSN MODE option
Update psql tab completion to support the MODE option in WAIT FOR LSN
command's WITH clause. After typing 'mode' inside the parenthesized
option list, completion offers the valid mode values: 'replay', 'write',
and 'flush'.
---
src/bin/psql/tab-complete.in.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b1ff6f6cd94..5cb8de14e8e 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5329,8 +5329,10 @@ match_previous_words(int pattern_id,
/*
* WAIT FOR LSN '<lsn>' [ WITH ( option [, ...] ) ]
* where option can be:
+ * MODE '<mode>'
* TIMEOUT '<timeout>'
* NO_THROW
+ * and mode can be: replay | write | flush
*/
else if (Matches("WAIT"))
COMPLETE_WITH("FOR");
@@ -5343,21 +5345,21 @@ match_previous_words(int pattern_id,
COMPLETE_WITH("WITH");
else if (Matches("WAIT", "FOR", "LSN", MatchAny, "WITH"))
COMPLETE_WITH("(");
+
+ /*
+ * Handle parenthesized option list. This fires when we're in an
+ * unfinished parenthesized option list. get_previous_words treats a
+ * completed parenthesized option list as one word, so the above test is
+ * correct. mode takes a string value ('replay', 'write', 'flush'),
+ * timeout takes a string value, no_throw takes no value.
+ */
else if (HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*") &&
!HeadMatches("WAIT", "FOR", "LSN", MatchAny, "WITH", "(*)"))
{
- /*
- * This fires if we're in an unfinished parenthesized option list.
- * get_previous_words treats a completed parenthesized option list as
- * one word, so the above test is correct.
- */
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
- COMPLETE_WITH("timeout", "no_throw");
-
- /*
- * timeout takes a string value, no_throw takes no value. We don't
- * offer completions for these values.
- */
+ COMPLETE_WITH("mode", "timeout", "no_throw");
+ else if (TailMatches("mode"))
+ COMPLETE_WITH("'replay'", "'write'", "'flush'");
}
/* WITH [RECURSIVE] */
--
2.51.0
v7-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchapplication/octet-stream; name=v7-0002-Add-MODE-option-to-WAIT-FOR-LSN-command.patchDownload
From 5136083fff62902515c82118250034e0ab75cf2f Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Tue, 16 Dec 2025 10:50:22 +0800
Subject: [PATCH v7 2/4] Add MODE option to WAIT FOR LSN command
Extend the WAIT FOR LSN command with an optional MODE option in the
WITH clause that specifies which LSN type to wait for:
WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)]
where mode can be:
- 'replay' (default): Wait for WAL to be replayed to the specified LSN
- 'write': Wait for WAL to be written (received) to the specified LSN
- 'flush': Wait for WAL to be flushed to disk at the specified LSN
The default mode is 'replay', matching the original behavior when MODE
is not specified. This follows the pattern used by COPY and EXPLAIN
commands where options are specified as string values in the WITH clause.
The 'write' and 'flush' modes are useful for scenarios where applications
need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete.
Also includes:
- Documentation updates for the new syntax and small refactoring for the existing ones
- Test coverage for all three modes including mixed concurrent waiters
- Wakeup logic in walreceiver for write/flush waiters
---
doc/src/sgml/ref/wait_for.sgml | 182 ++++++++++----
src/backend/access/transam/xlog.c | 6 +-
src/backend/commands/wait.c | 74 +++++-
src/backend/replication/walreceiver.c | 19 ++
src/test/recovery/t/049_wait_for_lsn.pl | 305 ++++++++++++++++++++++--
5 files changed, 508 insertions(+), 78 deletions(-)
diff --git a/doc/src/sgml/ref/wait_for.sgml b/doc/src/sgml/ref/wait_for.sgml
index 3b8e842d1de..122012f5613 100644
--- a/doc/src/sgml/ref/wait_for.sgml
+++ b/doc/src/sgml/ref/wait_for.sgml
@@ -16,17 +16,23 @@ PostgreSQL documentation
<refnamediv>
<refname>WAIT FOR</refname>
- <refpurpose>wait for target <acronym>LSN</acronym> to be replayed, optionally with a timeout</refpurpose>
+ <refpurpose>wait for WAL to reach a target <acronym>LSN</acronym> on a replica</refpurpose>
</refnamediv>
<refsynopsisdiv>
<synopsis>
-WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
+WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>'
+ [ WITH ( <replaceable class="parameter">option</replaceable> [, ...] ) ]
<phrase>where <replaceable class="parameter">option</replaceable> can be:</phrase>
+ MODE '<replaceable class="parameter">mode</replaceable>'
TIMEOUT '<replaceable class="parameter">timeout</replaceable>'
NO_THROW
+
+<phrase>and <replaceable class="parameter">mode</replaceable> can be:</phrase>
+
+ replay | write | flush
</synopsis>
</refsynopsisdiv>
@@ -34,20 +40,22 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Description</title>
<para>
- Waits until recovery replays <parameter>lsn</parameter>.
- If no <parameter>timeout</parameter> is specified or it is set to
- zero, this command waits indefinitely for the
- <parameter>lsn</parameter>.
- On timeout, or if the server is promoted before
- <parameter>lsn</parameter> is reached, an error is emitted,
- unless <literal>NO_THROW</literal> is specified in the WITH clause.
- If <parameter>NO_THROW</parameter> is specified, then the command
- doesn't throw errors.
+ Waits until the specified <parameter>lsn</parameter> is reached
+ according to the specified <parameter>mode</parameter>,
+ which determines whether to wait for WAL to be written, flushed, or replayed.
+ If no <parameter>timeout</parameter> is specified or it is set to
+ zero, this command waits indefinitely for the
+ <parameter>lsn</parameter>.
+ On timeout, or if the server is promoted before
+ <parameter>lsn</parameter> is reached, an error is emitted,
+ unless <literal>NO_THROW</literal> is specified in the WITH clause.
+ If <parameter>NO_THROW</parameter> is specified, then the command
+ doesn't throw errors.
</para>
<para>
- The possible return values are <literal>success</literal>,
- <literal>timeout</literal>, and <literal>not in recovery</literal>.
+ The possible return values are <literal>success</literal>,
+ <literal>timeout</literal>, and <literal>not in recovery</literal>.
</para>
</refsect1>
@@ -72,6 +80,52 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
The following parameters are supported:
<variablelist>
+ <varlistentry>
+ <term><literal>MODE</literal> '<replaceable class="parameter">mode</replaceable>'</term>
+ <listitem>
+ <para>
+ Specifies the type of LSN processing to wait for. If not specified,
+ the default is <literal>replay</literal>. The valid modes are:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>replay</literal>: Wait for the LSN to be replayed
+ (applied to the database). After successful completion,
+ <function>pg_last_wal_replay_lsn()</function> will return a
+ value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>flush</literal>: Wait for the WAL containing the LSN
+ to be received from the primary and flushed to disk. This
+ provides a durability guarantee without waiting for the WAL
+ to be applied. After successful completion,
+ <function>pg_last_wal_receive_lsn()</function> will return a
+ value greater than or equal to the target LSN. This value is
+ also available as the <structfield>flushed_lsn</structfield>
+ column in <link linkend="monitoring-pg-stat-wal-receiver-view">
+ <structname>pg_stat_wal_receiver</structname></link>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>write</literal>: Wait for the WAL containing the LSN
+ to be received from the primary and written to disk, but not
+ yet flushed. This is faster than <literal>flush</literal> but
+ provides weaker durability guarantees since the data may still
+ be in operating system buffers. After successful completion, the
+ <structfield>written_lsn</structfield> column in
+ <link linkend="monitoring-pg-stat-wal-receiver-view">
+ <structname>pg_stat_wal_receiver</structname></link> will show
+ a value greater than or equal to the target LSN.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>TIMEOUT</literal> '<replaceable class="parameter">timeout</replaceable>'</term>
<listitem>
@@ -135,9 +189,12 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<listitem>
<para>
This return value denotes that the database server is not in a recovery
- state. This might mean either the database server was not in recovery
- at the moment of receiving the command, or it was promoted before
- reaching the target <parameter>lsn</parameter>.
+ state. This might mean either the database server was not in recovery
+ at the moment of receiving the command (i.e., executed on a primary),
+ or it was promoted before reaching the target <parameter>lsn</parameter>.
+ In the promotion case, this status indicates a timeline change occurred,
+ and the application should re-evaluate whether the target LSN is still
+ relevant.
</para>
</listitem>
</varlistentry>
@@ -148,25 +205,33 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Notes</title>
<para>
- <command>WAIT FOR</command> command waits till
- <parameter>lsn</parameter> to be replayed on standby.
- That is, after this command execution, the value returned by
- <function>pg_last_wal_replay_lsn</function> should be greater or equal
- to the <parameter>lsn</parameter> value. This is useful to achieve
- read-your-writes-consistency, while using async replica for reads and
- primary for writes. In that case, the <acronym>lsn</acronym> of the last
- modification should be stored on the client application side or the
- connection pooler side.
+ <command>WAIT FOR</command> waits until the specified
+ <parameter>lsn</parameter> is reached according to the specified
+ <parameter>mode</parameter>. The <literal>REPLAY</literal> mode waits
+ for the LSN to be replayed (applied to the database), which is useful
+ to achieve read-your-writes consistency while using an async replica
+ for reads and the primary for writes. The <literal>FLUSH</literal> mode
+ waits for the WAL to be flushed to durable storage on the replica,
+ providing a durability guarantee without waiting for replay. The
+ <literal>WRITE</literal> mode waits for the WAL to be written to the
+ operating system, which is faster than flush but provides weaker
+ durability guarantees. In all cases, the <acronym>LSN</acronym> of the
+ last modification should be stored on the client application side or
+ the connection pooler side.
</para>
<para>
- <command>WAIT FOR</command> command should be called on standby.
- If a user runs <command>WAIT FOR</command> on primary, it
- will error out unless <parameter>NO_THROW</parameter> is specified in the WITH clause.
- However, if <command>WAIT FOR</command> is
- called on primary promoted from standby and <literal>lsn</literal>
- was already replayed, then the <command>WAIT FOR</command> command just
- exits immediately.
+ <command>WAIT FOR</command> should be called on a standby.
+ If a user runs <command>WAIT FOR</command> on the primary, it
+ will error out unless <parameter>NO_THROW</parameter> is specified
+ in the WITH clause. However, if <command>WAIT FOR</command> is
+ called on a primary promoted from standby and <literal>lsn</literal>
+ was already reached, then the <command>WAIT FOR</command> command
+ just exits immediately. If the replica is promoted while waiting,
+ the command will return <literal>not in recovery</literal> (or throw
+ an error if <literal>NO_THROW</literal> is not specified). Promotion
+ creates a new timeline, and the LSN being waited for may refer to
+ WAL from the old timeline.
</para>
</refsect1>
@@ -175,21 +240,21 @@ WAIT FOR LSN '<replaceable class="parameter">lsn</replaceable>' [ WITH ( <replac
<title>Examples</title>
<para>
- You can use <command>WAIT FOR</command> command to wait for
- the <type>pg_lsn</type> value. For example, an application could update
- the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
- changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
- on primary server to get the <acronym>lsn</acronym> given that
- <varname>synchronous_commit</varname> could be set to
- <literal>off</literal>.
+ You can use <command>WAIT FOR</command> command to wait for
+ the <type>pg_lsn</type> value. For example, an application could update
+ the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+ changes just made. This example uses <function>pg_current_wal_insert_lsn</function>
+ on primary server to get the <acronym>lsn</acronym> given that
+ <varname>synchronous_commit</varname> could be set to
+ <literal>off</literal>.
<programlisting>
postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
UPDATE 100
postgres=# SELECT pg_current_wal_insert_lsn();
-pg_current_wal_insert_lsn
---------------------
-0/306EE20
+ pg_current_wal_insert_lsn
+---------------------------
+ 0/306EE20
(1 row)
</programlisting>
@@ -200,7 +265,7 @@ pg_current_wal_insert_lsn
<programlisting>
postgres=# WAIT FOR LSN '0/306EE20';
status
---------
+---------
success
(1 row)
postgres=# SELECT * FROM movie WHERE genre = 'Drama';
@@ -211,7 +276,31 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
</para>
<para>
- If the target LSN is not reached before the timeout, the error is thrown.
+ Wait for flush (data durable on replica):
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'flush');
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ Wait for write with timeout:
+
+<programlisting>
+postgres=# WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '100ms', NO_THROW);
+ status
+---------
+ success
+(1 row)
+</programlisting>
+ </para>
+
+ <para>
+ If the target LSN is not reached before the timeout, an error is thrown:
<programlisting>
postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '0.1s');
@@ -221,11 +310,12 @@ ERROR: timed out while waiting for target LSN 0/306EE20 to be replayed; current
<para>
The same example uses <command>WAIT FOR</command> with
- <parameter>NO_THROW</parameter> option.
+ <parameter>NO_THROW</parameter> option:
+
<programlisting>
postgres=# WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '100ms', NO_THROW);
status
---------
+---------
timeout
(1 row)
</programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a6e348f2109..5c6f9feeccc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6238,10 +6238,12 @@ StartupXLOG(void)
LWLockRelease(ControlFileLock);
/*
- * Wake up all waiters for replay LSN. They need to report an error that
- * recovery was ended before reaching the target LSN.
+ * Wake up all waiters. They need to report an error that recovery was
+ * ended before reaching the target LSN.
*/
WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_REPLAY, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, InvalidXLogRecPtr);
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, InvalidXLogRecPtr);
/*
* Shutdown the recovery environment. This must occur after
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
index dd2570cb787..b3f1f7b8a69 100644
--- a/src/backend/commands/wait.c
+++ b/src/backend/commands/wait.c
@@ -2,7 +2,7 @@
*
* wait.c
* Implements WAIT FOR, which allows waiting for events such as
- * time passing or LSN having been replayed on replica.
+ * time passing or LSN having been replayed, flushed, or written.
*
* Portions Copyright (c) 2025, PostgreSQL Global Development Group
*
@@ -15,6 +15,7 @@
#include <math.h>
+#include "access/xlog.h"
#include "access/xlogrecovery.h"
#include "access/xlogwait.h"
#include "commands/defrem.h"
@@ -28,18 +29,35 @@
#include "utils/snapmgr.h"
+/*
+ * Type descriptor for WAIT FOR LSN wait types, used for error messages.
+ */
+typedef struct WaitLSNTypeDesc
+{
+ const char *noun; /* "replay", "flush", "write" */
+ const char *verb; /* "replayed", "flushed", "written" */
+} WaitLSNTypeDesc;
+
+static const WaitLSNTypeDesc WaitLSNTypeDescs[] = {
+ [WAIT_LSN_TYPE_STANDBY_REPLAY] = {"replay", "replayed"},
+ [WAIT_LSN_TYPE_STANDBY_WRITE] = {"write", "written"},
+ [WAIT_LSN_TYPE_STANDBY_FLUSH] = {"flush", "flushed"},
+};
+
void
ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
{
XLogRecPtr lsn;
int64 timeout = 0;
WaitLSNResult waitLSNResult;
+ WaitLSNType lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY; /* default */
bool throw = true;
TupleDesc tupdesc;
TupOutputState *tstate;
const char *result = "<unset>";
bool timeout_specified = false;
bool no_throw_specified = false;
+ bool mode_specified = false;
/* Parse and validate the mandatory LSN */
lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
@@ -47,7 +65,30 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
foreach_node(DefElem, defel, stmt->options)
{
- if (strcmp(defel->defname, "timeout") == 0)
+ if (strcmp(defel->defname, "mode") == 0)
+ {
+ char *mode_str;
+
+ if (mode_specified)
+ errorConflictingDefElem(defel, pstate);
+ mode_specified = true;
+
+ mode_str = defGetString(defel);
+
+ if (pg_strcasecmp(mode_str, "replay") == 0)
+ lsnType = WAIT_LSN_TYPE_STANDBY_REPLAY;
+ else if (pg_strcasecmp(mode_str, "write") == 0)
+ lsnType = WAIT_LSN_TYPE_STANDBY_WRITE;
+ else if (pg_strcasecmp(mode_str, "flush") == 0)
+ lsnType = WAIT_LSN_TYPE_STANDBY_FLUSH;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("unrecognized value for WAIT option \"%s\": \"%s\"",
+ "MODE", mode_str),
+ parser_errposition(pstate, defel->location)));
+ }
+ else if (strcmp(defel->defname, "timeout") == 0)
{
char *timeout_str;
const char *hintmsg;
@@ -107,8 +148,8 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
}
/*
- * We are going to wait for the LSN replay. We should first care that we
- * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+ * We are going to wait for the LSN. We should first care that we don't
+ * hold a snapshot and correspondingly our MyProc->xmin is invalid.
* Otherwise, our snapshot could prevent the replay of WAL records
* implying a kind of self-deadlock. This is the reason why WAIT FOR is a
* command, not a procedure or function.
@@ -140,7 +181,7 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
*/
Assert(MyProc->xmin == InvalidTransactionId);
- waitLSNResult = WaitForLSN(WAIT_LSN_TYPE_STANDBY_REPLAY, lsn, timeout);
+ waitLSNResult = WaitForLSN(lsnType, lsn, timeout);
/*
* Process the result of WaitForLSN(). Throw appropriate error if needed.
@@ -154,11 +195,18 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_TIMEOUT:
if (throw)
+ {
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
ereport(ERROR,
errcode(ERRCODE_QUERY_CANCELED),
- errmsg("timed out while waiting for target LSN %X/%08X to be replayed; current replay LSN %X/%08X",
+ errmsg("timed out while waiting for target LSN %X/%08X to be %s; current %s LSN %X/%08X",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
+ }
else
result = "timeout";
break;
@@ -166,20 +214,26 @@ ExecWaitStmt(ParseState *pstate, WaitStmt *stmt, DestReceiver *dest)
case WAIT_LSN_RESULT_NOT_IN_RECOVERY:
if (throw)
{
+ const WaitLSNTypeDesc *desc = &WaitLSNTypeDescs[lsnType];
+ XLogRecPtr currentLSN = GetCurrentLSNForWaitType(lsnType);
+
if (PromoteIsTriggered())
{
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errdetail("Recovery ended before replaying target LSN %X/%08X; last replay LSN %X/%08X.",
+ errdetail("Recovery ended before target LSN %X/%08X was %s; last %s LSN %X/%08X.",
LSN_FORMAT_ARGS(lsn),
- LSN_FORMAT_ARGS(GetXLogReplayRecPtr(NULL))));
+ desc->verb,
+ desc->noun,
+ LSN_FORMAT_ARGS(currentLSN)));
}
else
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
- errhint("Waiting for the replay LSN can only be executed during recovery."));
+ errhint("Waiting for the %s LSN can only be executed during recovery.",
+ desc->noun));
}
else
result = "not in recovery";
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ac802ae85b4..e15c5645b9c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
#include "access/xlogrecovery.h"
+#include "access/xlogwait.h"
#include "catalog/pg_authid.h"
#include "funcapi.h"
#include "libpq/pqformat.h"
@@ -965,6 +966,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, TimeLineID tli)
/* Update shared-memory status */
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+ /*
+ * If we wrote an LSN that someone was waiting for then walk over the
+ * shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Write >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_WRITE])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_WRITE, LogstreamResult.Write);
+
/*
* Close the current segment if it's fully written up in the last cycle of
* the loop, to create its archive notification file soon. Otherwise WAL
@@ -1004,6 +1014,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
}
SpinLockRelease(&walrcv->mutex);
+ /*
+ * If we flushed an LSN that someone was waiting for then walk over
+ * the shared memory array and set latches to notify the waiters.
+ */
+ if (waitLSNState &&
+ (LogstreamResult.Flush >=
+ pg_atomic_read_u64(&waitLSNState->minWaitedLSN[WAIT_LSN_TYPE_STANDBY_FLUSH])))
+ WaitLSNWakeup(WAIT_LSN_TYPE_STANDBY_FLUSH, LogstreamResult.Flush);
+
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
diff --git a/src/test/recovery/t/049_wait_for_lsn.pl b/src/test/recovery/t/049_wait_for_lsn.pl
index e0ddb06a2f0..b589cecc028 100644
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -1,4 +1,4 @@
-# Checks waiting for the LSN replay on standby using
+# Checks waiting for the LSN replay/write/flush on standby using
# the WAIT FOR command.
use strict;
use warnings FATAL => 'all';
@@ -7,6 +7,38 @@ use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;
+# Helper functions to control walreceiver for testing wait conditions.
+# These allow us to stop WAL streaming so waiters block, then resume it.
+my $saved_primary_conninfo;
+
+sub stop_walreceiver
+{
+ my ($node) = @_;
+ $saved_primary_conninfo = $node->safe_psql('postgres',
+ "SELECT setting FROM pg_settings WHERE name = 'primary_conninfo'");
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT NOT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
+sub resume_walreceiver
+{
+ my ($node) = @_;
+ $node->safe_psql(
+ 'postgres', qq[
+ ALTER SYSTEM SET primary_conninfo = '$saved_primary_conninfo';
+ SELECT pg_reload_conf();
+ ]);
+
+ $node->poll_query_until('postgres',
+ "SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);");
+}
+
# Initialize primary node
my $node_primary = PostgreSQL::Test::Cluster->new('primary');
$node_primary->init(allows_streaming => 1);
@@ -62,7 +94,34 @@ $output = $node_standby->safe_psql(
ok((split("\n", $output))[-1] eq 30,
"standby reached the same LSN as primary");
-# 3. Check that waiting for unreachable LSN triggers the timeout. The
+# 3. Check that WAIT FOR works with WRITE and FLUSH modes.
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn_write =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_write}' WITH (MODE 'write', timeout '1d');
+ SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${lsn_write}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby wrote WAL up to target LSN after WAIT FOR with MODE 'write'");
+
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn_flush =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ WAIT FOR LSN '${lsn_flush}' WITH (MODE 'flush', timeout '1d');
+ SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '${lsn_flush}'::pg_lsn);
+]);
+
+ok((split("\n", $output))[-1] >= 0,
+ "standby flushed WAL up to target LSN after WAIT FOR with MODE 'flush'");
+
+# 4. Check that waiting for unreachable LSN triggers the timeout. The
# unreachable LSN must be well in advance. So WAL records issued by
# the concurrent autovacuum could not affect that.
my $lsn3 =
@@ -88,7 +147,7 @@ $output = $node_standby->safe_psql(
WAIT FOR LSN '${lsn3}' WITH (timeout '10ms', no_throw);]);
ok($output eq "timeout", "WAIT FOR returns correct status after timeout");
-# 4. Check that WAIT FOR triggers an error if called on primary,
+# 5. Check that WAIT FOR triggers an error if called on primary,
# within another function, or inside a transaction with an isolation level
# higher than READ COMMITTED.
@@ -125,7 +184,7 @@ ok( $stderr =~
/WAIT FOR must be only called without an active or registered snapshot/,
"get an error when running within another function");
-# 5. Check parameter validation error cases on standby before promotion
+# 6. Check parameter validation error cases on standby before promotion
my $test_lsn =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
@@ -208,7 +267,23 @@ $node_standby->psql(
ok( $stderr =~ /option "invalid_option" not recognized/,
"get error for invalid WITH clause option");
-# 6. Check the scenario of multiple LSN waiters. We make 5 background
+# Test invalid MODE value
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (MODE 'invalid');",
+ stderr => \$stderr);
+ok($stderr =~ /unrecognized value for WAIT option "MODE": "invalid"/,
+ "get error for invalid MODE value");
+
+# Test duplicate MODE parameter
+$node_standby->psql(
+ 'postgres',
+ "WAIT FOR LSN '${test_lsn}' WITH (MODE 'replay', MODE 'write');",
+ stderr => \$stderr);
+ok( $stderr =~ /conflicting or redundant options/,
+ "get error for duplicate MODE parameter");
+
+# 7a. Check the scenario of multiple REPLAY waiters. We make 5 background
# psql sessions each waiting for a corresponding insertion. When waiting is
# finished, stored procedures logs if there are visible as many rows as
# should be.
@@ -226,7 +301,9 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
\$\$
LANGUAGE plpgsql;
]);
+
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
my @psql_sessions;
for (my $i = 0; $i < 5; $i++)
{
@@ -243,6 +320,7 @@ for (my $i = 0; $i < 5; $i++)
SELECT log_count(${i});
]);
}
+
my $log_offset = -s $node_standby->logfile;
$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
for (my $i = 0; $i < 5; $i++)
@@ -251,23 +329,200 @@ for (my $i = 0; $i < 5; $i++)
$psql_sessions[$i]->quit;
}
-ok(1, 'multiple LSN waiters reported consistent data');
+ok(1, 'multiple REPLAY waiters reported consistent data');
+
+# 7b. Check the scenario of multiple WRITE waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @write_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (100 + ${i});");
+ $write_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start WRITE waiters (they will block since walreceiver is stopped)
+my @write_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $write_sessions[$i] = $node_standby->background_psql('postgres');
+ $write_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$write_lsns[$i]' WITH (MODE 'write', timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'write_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalWrite'"
+);
+
+# Restore walreceiver to unblock waiters
+my $write_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("write_done $i", $write_log_offset);
+ $write_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was written up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '$write_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple WRITE waiters: standby wrote WAL up to target LSN");
+
+# 7c. Check the scenario of multiple FLUSH waiters.
+# Stop walreceiver to ensure waiters actually block.
+stop_walreceiver($node_standby);
+
+# Generate WAL on primary (standby won't receive it yet)
+my @flush_lsns;
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (200 + ${i});");
+ $flush_lsns[$i] =
+ $node_primary->safe_psql('postgres',
+ "SELECT pg_current_wal_insert_lsn()");
+}
+
+# Start FLUSH waiters (they will block since walreceiver is stopped)
+my @flush_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+ $flush_sessions[$i] = $node_standby->background_psql('postgres');
+ $flush_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '$flush_lsns[$i]' WITH (MODE 'flush', timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'flush_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 5 FROM pg_stat_activity WHERE wait_event = 'WaitForWalFlush'"
+);
+
+# Restore walreceiver to unblock waiters
+my $flush_log_offset = -s $node_standby->logfile;
+resume_walreceiver($node_standby);
+
+# Wait for all waiters to complete and close sessions
+for (my $i = 0; $i < 5; $i++)
+{
+ $node_standby->wait_for_log("flush_done $i", $flush_log_offset);
+ $flush_sessions[$i]->quit;
+}
+
+# Verify on standby that WAL was flushed up to the target LSN
+$output = $node_standby->safe_psql('postgres',
+ "SELECT pg_lsn_cmp(pg_last_wal_receive_lsn(), '$flush_lsns[4]'::pg_lsn);"
+);
+
+ok($output >= 0,
+ "multiple FLUSH waiters: standby flushed WAL up to target LSN");
+
+# 7d. Check the scenario of mixed mode waiters (REPLAY, WRITE, FLUSH)
+# running concurrently. We start 6 sessions: 2 for each mode, all waiting
+# for the same target LSN. We stop the walreceiver and pause replay to
+# ensure all waiters block. Then we resume replay and restart the
+# walreceiver to verify they unblock and complete correctly.
+
+# Stop walreceiver first to ensure we can control the flow without hanging
+# (stopping it after pausing replay can hang if the startup process is paused).
+stop_walreceiver($node_standby);
-# 7. Check that the standby promotion terminates the wait on LSN. Start
-# waiting for an unreachable LSN then promote. Check the log for the relevant
-# error message. Also, check that waiting for already replayed LSN doesn't
-# cause an error even after promotion.
+# Pause replay
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+
+# Generate WAL on primary
+$node_primary->safe_psql('postgres',
+ "INSERT INTO wait_test VALUES (generate_series(301, 310));");
+my $mixed_target_lsn =
+ $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+# Start 6 waiters: 2 for each mode
+my @mixed_sessions;
+my @mixed_modes = ('replay', 'write', 'flush');
+for (my $i = 0; $i < 6; $i++)
+{
+ $mixed_sessions[$i] = $node_standby->background_psql('postgres');
+ $mixed_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${mixed_target_lsn}' WITH (MODE '$mixed_modes[$i % 3]', timeout '1d');
+ DO \$\$ BEGIN RAISE LOG 'mixed_done %', $i; END \$\$;
+ ]);
+}
+
+# Verify all waiters are blocked
+$node_standby->poll_query_until('postgres',
+ "SELECT count(*) = 6 FROM pg_stat_activity WHERE wait_event LIKE 'WaitForWal%'"
+);
+
+# Resume replay (waiters should still be blocked as no WAL has arrived)
+my $mixed_log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+$node_standby->poll_query_until('postgres',
+ "SELECT NOT pg_is_wal_replay_paused();");
+
+# Restore walreceiver to allow WAL to arrive
+resume_walreceiver($node_standby);
+
+# Wait for all sessions to complete and close them
+for (my $i = 0; $i < 6; $i++)
+{
+ $node_standby->wait_for_log("mixed_done $i", $mixed_log_offset);
+ $mixed_sessions[$i]->quit;
+}
+
+# Verify all modes reached the target LSN
+$output = $node_standby->safe_psql(
+ 'postgres', qq[
+ SELECT pg_lsn_cmp((SELECT written_lsn FROM pg_stat_wal_receiver), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_receive_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0 AND
+ pg_lsn_cmp(pg_last_wal_replay_lsn(), '${mixed_target_lsn}'::pg_lsn) >= 0;
+]);
+
+ok($output eq 't',
+ "mixed mode waiters: all modes completed and reached target LSN");
+
+# 8. Check that the standby promotion terminates all wait modes. Start
+# waiting for unreachable LSNs with REPLAY, WRITE, and FLUSH modes, then
+# promote. Check the log for the relevant error messages. Also, check that
+# waiting for already replayed LSN doesn't cause an error even after promotion.
my $lsn4 =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn() + 10000000000");
+
my $lsn5 =
$node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby->background_psql('postgres');
-$psql_session->query_until(
- qr/start/, qq[
- \\echo start
- WAIT FOR LSN '${lsn4}';
-]);
+
+# Start background sessions waiting for unreachable LSN with all modes
+my @wait_modes = ('replay', 'write', 'flush');
+my @wait_sessions;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i] = $node_standby->background_psql('postgres');
+ $wait_sessions[$i]->query_until(
+ qr/start/, qq[
+ \\echo start
+ WAIT FOR LSN '${lsn4}' WITH (MODE '$wait_modes[$i]');
+ ]);
+}
# Make sure standby will be promoted at least at the primary insert LSN we
# have just observed. Use pg_switch_wal() to force the insert LSN to be
@@ -277,9 +532,16 @@ $node_primary->wait_for_catchup($node_standby);
$log_offset = -s $node_standby->logfile;
$node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
-ok(1, 'got error after standby promote');
+# Wait for all three sessions to get the error (each mode has distinct message)
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was written/,
+ $log_offset);
+$node_standby->wait_for_log(qr/Recovery ended before target LSN.*was flushed/,
+ $log_offset);
+$node_standby->wait_for_log(
+ qr/Recovery ended before target LSN.*was replayed/, $log_offset);
+
+ok(1, 'promotion interrupted all wait modes');
$node_standby->safe_psql('postgres', "WAIT FOR LSN '${lsn5}';");
@@ -295,8 +557,11 @@ ok($output eq "not in recovery",
$node_standby->stop;
$node_primary->stop;
-# If we send \q with $psql_session->quit the command can be sent to the session
+# If we send \q with $session->quit the command can be sent to the session
# already closed. So \q is in initial script, here we only finish IPC::Run.
-$psql_session->{run}->finish;
+for (my $i = 0; $i < 3; $i++)
+{
+ $wait_sessions[$i]->{run}->finish;
+}
done_testing();
--
2.51.0
On Fri, Dec 19, 2025 at 4:50 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Thu, Dec 18, 2025 at 8:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Dec 18, 2025 at 2:24 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Thu, Dec 18, 2025 at 6:38 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi, Xuneng!
On Tue, Dec 16, 2025 at 6:46 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Remove the erroneous WAIT_LSN_TYPE_COUNT case from the switch
statement in v5 patch 1.Thank you for your work on this patchset. Generally, it looks like
good and quite straightforward extension of the current functionality.
But this patch adds 4 new unreserved keywords to our grammar. Do you
think we can put mode into with options clause?Thanks for pointing this out. Yeah, 4 unreserved keywords add
complexity to the parser and it may not be worthwhile since replay is
expected to be the common use scenario. Maybe we can do something like
this:-- Default (REPLAY mode)
WAIT FOR LSN '0/306EE20' WITH (TIMEOUT '1s');-- Explicit REPLAY mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'replay', TIMEOUT '1s');-- WRITE mode
WAIT FOR LSN '0/306EE20' WITH (MODE 'write', TIMEOUT '1s');If no mode is set explicitly in the options clause, it defaults to
replay. I'll update the patch per your suggestion.This is exactly what I meant. Please, go ahead.
Here is the updated patch set (v7). Please check.
I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?
If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?
------
Regards,
Alexander Korotkov
Supabase
Hi Alexander,
Thanks for your feedback!
I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?
I think this constraint could be relaxed if needed. I was previously
unsure about the use cases.
If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?
Technically, we can use 'flush' mode to specify WAIT FOR behavior in
both primary and replica. Currently, wait for commands error out if
promotion occurs since: either the requested LSN type does not exist
on the primary, or we do not yet have the infrastructure to support
continuing the wait. If we allow waiting for flush on the primary as a
user-visible command and the wake-up calls for flush in primary are
introduced, the question becomes whether we should still abort the
wait on promotion, or continue waiting—as you noted—given that the
target LSN might still be reached, albeit on a different timeline. The
question behind this might be: do users care and should be aware of
the state change of the server while waiting? If they do, then we
better stop the waiting and report the error. In this case, I am
inclined to to break the unified flush mode to something like
primary_flush/standby_flush mode and
WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.
--
Best,
Xuneng
Hi,
On Sun, Dec 21, 2025 at 12:37 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Alexander,
Thanks for your feedback!
I see that we can't specify WAIT_LSN_TYPE_PRIMARY_FLUSH by setting
mode parameter. Should we allow this?I think this constraint could be relaxed if needed. I was previously
unsure about the use cases.
Flush mode on the primary seems useful when synchronous_commit is set
to off [1]https://postgresqlco.nf/doc/en/param/synchronous_commit/. In that mode, a transaction in primary may return success
before its WAL is durably flushed to disk, trading durability for
lower latency. A “wait for primary flush” operation provides an
explicit durability barrier for cases where applications or tools
occasionally need stronger guarantees.
[1]: https://postgresqlco.nf/doc/en/param/synchronous_commit/
If we allow specifying WAIT_LSN_TYPE_PRIMARY_FLUSH, should it be
separate mode value or the same with WAIT_LSN_TYPE_STANDBY_FLUSH? In
principle, we could encode both as just 'flush' mode, and detect which
WaitLSNType to pick by checking if recovery is in progress. However,
how should we then react to unreached flush location after standby
promotion (technically it could be still reached but on the different
timeline)?Technically, we can use 'flush' mode to specify WAIT FOR behavior in
both primary and replica. Currently, wait for commands error out if
promotion occurs since: either the requested LSN type does not exist
on the primary, or we do not yet have the infrastructure to support
continuing the wait. If we allow waiting for flush on the primary as a
user-visible command and the wake-up calls for flush in primary are
introduced, the question becomes whether we should still abort the
wait on promotion, or continue waiting—as you noted—given that the
target LSN might still be reached, albeit on a different timeline. The
question behind this might be: do users care and should be aware of
the state change of the server while waiting? If they do, then we
better stop the waiting and report the error. In this case, I am
inclined to to break the unified flush mode to something like
primary_flush/standby_flush mode and
WAIT_LSN_TYPE_PRIMARY_FLUSH/WAIT_LSN_TYPE_STANDBY_FLUSH respectively.
After further consideration, it also seems reasonable to use a single,
unified flush mode that works on both primary and standby servers,
provided its semantics are clearly documented to avoid the potential
confusion on failure. I don’t have a strong preference between these
two and would be interested in your thoughts.
If a standby is promoted while a session is waiting, the command
better abort and return an error (or report “not in recovery” when
using NO_THROW). At that point, the meaning of the LSN being waited
for may have changed due to the timeline switch and the transition
from standby to primary. An LSN such as 0/5000000 on TLI 2 can
represent entirely different WAL content from 0/5000000 on TLI 1.
Allowing the wait to silently continue across promotion risks giving
users a false sense of safety—for example, interpreting “wait
completed” as “the original data is now durable,” which would no
longer be true.
--
Best,
Xuneng
Attachments:
synchronous_commit.pngimage/png; name=synchronous_commit.pngDownload
�PNG
IHDR p � �K�? SiCCPICC Profile (�u�O(DQ��g����B,��b�h��Y��iP��y�O����')kk����Z�NVJVX)�l,m�I�wg03��������Nh�����������k�F�:�D?t�vD$[`��Z�[hJo��������8��4�>p.�����
&S�M}g�l!]@3��mW(�%�J.E�W�����D�Oj��x�|E���V�|O6M~����-�k���*�,�9�A,b�"
$�*��O��Q&v �� �"trH��8��8Lr!������5��20=@(6��a�\�K
o��������sU��w��:�h���! p
T���z^�7��g��Ic�
� } �eXIfMM * > F( �i N � � �� x� p� � ASCII Screenshot;��� pHYs % %IR$� �iTXtXML:com.adobe.xmp <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 6.0.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:exif="http://ns.adobe.com/exif/1.0/">
<exif:PixelYDimension>1178</exif:PixelYDimension>
<exif:PixelXDimension>1904</exif:PixelXDimension>
<exif:UserComment>Screenshot</exif:UserComment>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
�V� iDOT M ( M M �|��8: @ IDATx���UE�G��ABl�.�����l�EA�R����VPQ�@LP�A���yf������{w��{�=1g���Y�N�Z������}���?�� �� �� �� �� �� �� e��Jb�-{����� �� �� �� �� �� �� �! \?T�� �� �� �� �� �� �� ���p�ty� �� �� �� �� �� �� ~��9&�� �� �� �� �� �� �@9 �r ])�� �� �� �� �� �� ��b��CE� �� �� �� �� �� �� �� P��@�G
�� �� �� �� �� �� �� ���p�P�c�� �� �� �� �� �� �� �b�-����� �� �� �� �� �� �� �! \?T�� �k����t�u��&��K�,������Fk�U���V_M�Q��o��O1����7Q�����8yZ1V?���6ZG-^�TM�yz�5rB��\Z�|�a�5VS�6���(���uj�Z5�� �%�_���Y9�+7 e� ����K��������9�>�W���h�&2~�������<]?�����+��rc�j���)��g_|G���?ESw��2.>�x�s�-�3������"�B`��VR7���j�\C]3�>5��������C�N�jX���x��� 5o���
/���n�@����Z��_�M����%K��6�hA [d
���\z���W;5����{�����v�Z�>��Y����gF�/����r� �%M�4P�zi^F;��d��2Y�]I�.�9�"6QE����_��5���I������PK�.5z���~/����� Pd���L��U�TQ7���j.��|�#��q��w�|+
��q��`����?W7��tQ�Y*Y���������ME���z����_�JV��7h���v�y�+�����8�B ����T��'�w��s��>sn�x/y A�Y��P�cq(��s��G��w�^; ����(���p
F��{Y��l�
��� P(/S(-Qq�Q�s~�i�eoR��`"��oo�j�h� j��77/�������_)����� Pd���L��x�
��^��[��7�~���?�w���H�}�l�>��[5��_�3���a��[�8�/�]��[�p���Y ���bRA�4���D��C����5���5Ou�|h����"B���2Iy�"j���j1��EvD����"�*��"��o��];p"8�UK5c�\����#+���c��:� 4��w�c��yO�����|�[�O��[l��m��^~}�d�LZ)-!b�M�\^X��#7��u;�����>���#�UG���������;�%*ymD�� ���.G�R��X�I����k�PJQ�@�! kp�A_�.��#���~�T^&)����b���x���5��`_ �b"� SF���[m����������V�;�L��1���;�%zk����Q,\����������F`�U��[����U���{@}1�������@��7mD���D��SU���r�>[u�����ce�������{o���/������t��+�����!�7��w�������x���<��I���CG���b��sy_��|�|p�O��v������s�3�]��z5o�[D�|��QXL�XEj�\C
���j��O�QY`5��T&��[�Z��k�,��� _]��@h�D"���SF���[F@�c��58Cp+x���w��[�;������/t�D�/�*�9�pP��D! 2~B����v�Q�P��b�*�^U����x��"�\��U��J!P(�w���%��t�\�.���G���)� �������z�xf1�1�V�>(o�@\%|p �qFd��hjQ�s~��(5�B@d�(��=w�(�d����U���xKn��]E��p+b��;�B�P�R�"J� �r�-�����*�R� P.�B�\`�-��#�
��%B���)� N�_�`*Q�s~��(5�B@d�(��=w�(�d����U���xKn��]E��p#Z�A��j���������j�*+������f���>��5���"J����M�=Zn��UO�UK���?j��yj����[��U�������oG��Z�����S�o������=w�e7��j�����������U��8�Til"�����F�����m6��:m��i�����>U��W�{�l�}��G��~�5s�j�*j���P;6�\5�_G�R���3�75i�45��1j��yn��z�4��n������u�7�w5������R~2������S�{��h�x���q%��\c5��6�T�j���]w�Zm�l=5��?����Yr_h����^������+{���j���a�:�z�j������������w��W|8���j��������=�x��������W=���j�O��\C���v�i�G�~��8����!���SS-�c�~�,���Y�������8��,~�[����D��[K�{��9��4����<���>�b��
V_}U�����b����U:^�4r��+��ru��N[����o�?ck����,=n������6k�V]u�����|YR���\���]v����5�0s m1i�����������[����H���������A�������wF�/^t{���W[E�}��fnz��wK�G}9t�]������~���~����q+��F��v�c�y���j��_�}��Qc�x
zf�p�P��'��x���g��=���{������������BN2��7i�@�Km��^�d�R3G3�i�Y1��C��vTO�xCM�s5��X�V�����{�������������9
�������w�\�;����P��]#�����uc�&k�����O���n������&�Oo�xS_�xw��0w2�m���j�UWQ������g��e���!����5y�M70��9�v�O�u�l��L�<�W��������~b�k1�\>�2������?����9����z������r���'�������=�K�}
�|�\�f�<�7�d=��o���i��o��Y��NkM+��H��?��}
�q��#|���g��6[n���{G����&���*�=�5���7��L��T����4k�zM��z�_Y���f�7k(�QT�:������|"S����e9d��~���E�K�}�m�>w��[m��z���j����<������S�]V�-9�|A�����o�es�:�G������|��Z������k����j�5�?�L���M�4P{i>l���]S-^�T���o5/�����e��Xvt
-#!#n�AS��Xu�U4��'�:���/��1��T����y+��8���������a�C�:�-7�`9����=w��J�y���\�h�
�1?o���F�O\C��<��j�� ������
�-~rH���}������~����/����}?���}�8H���M�c|O3��������k������N�z��z�E��0�}��s�)+���<8|��*+�l����;�>3�@[�#���Rw=�u����5a2�-��i���9���j�i����.��1�}����o�m9�.�����#�Qoi��'����.U�TQ��q��]�\|�U�{��0n��ff>��
��`�O��/����������~����b~�"���~�*�����4�������S�}�4���u��~�5!�����o����#_���� $�b�
@����SS;j�
&:�0�bhP�^p�Q�TF ��-^��w���_w�����w������=�X#�����?��u���C%�r�!{������^g�#��|�S%{��D ���s�u�FQ�����xY����;���{Y���G������Z�y��K.�|�T3"w<0B+z������C�4BM�^7��8�������\�;���n<�2��(oO<��Q���J�'t_q
J�6ZG���ls��[��erc����6
k��.3�m#��(�B+=�%�}���F��*����1N�� ����bD���A>�l� ��W�l�(cF��c��t�z���g��A�����^����u���n���|���I����'#�s�
_1����"��g�r�j0)���g_�Y �T�%��tn�k���Jb�����s~�}#��)� ��w�6h�����|��3^
������������Oxo���7���]���1���p�����;�u�9�P4b|{��Czu9K5��>�(���r�+8%Q��b��
O<j?3wb��#����1�Gp���9��+������|�^���Wp�];��s�sg���s��Ds-��?{�����9���~��T+~]���d)������!h�=Z��S� w�vTC)E8����X�������G�_��d�U����sh�p��>%������1g_�����*��}a��������]�2����y��?�&��4|@gQ�a�:^X��O�r������[�y�=���4�`
��o�!�����}fi��P��0� #�Y�a��C������{�ch�`pA)D�����~�{����3J]M��*��`�AV��
���\�:�R�����8���z���x�C/�q!
�B��/�t�_�`W�<D� X�_x���K�q//�Q(�P���� ���@D�;�r.��Aez��5�{<�w���8z���w���vSu������Z~�6^?�!c��7�Q�N4F|~{ �������v����i�!As��L���^;���t���
-{�;���g��L�8��� ���uU8r��g_|���0A�\��0��z0�L�����}��>;^�}��}�v�A��l~������;���9���n����;oc)q�"����|*�������a�K�G�p�_UK�i���'@�.f�������:�>�{y������K�+9����w8xQ�r]�G�j�X�����]��So�-M����V����]����gl���n�o2��1�.d��p}P���K�SJ!?N�E3���!e���F)J,3� �����7����MU�O2�#��B$F�����D"l�~s�������V��������q`#E���(Q�1i��������x�c�
����z�8 R�������\n<�;\>4��]���w����4c������$�m��V��9�g�o���-����<�y��)�z�4�W���=��DvA_�����|��CT�!��4^���.���'"�Z�eF�-u�Q9xt���q�n��{>�;�'}o?oz�3u{T��F��"��~'k4��b
��?�����;�;W��������O7����65�1��m���U'��#-��3��v�]+O��i���Dz��5N\a��9���$���ME�m���%�%�n�ko4D����V���J��;���C�#���M2��D��|�:M�LP�s�����X�&���q���z� �A��gG�/�g������-��83�����h��1R�cM�a��vL��� r����>�t��
! |�h���g�BH$� �!*�0��cD������;��t����wxOs��G^T����s����/m�QjQ���dc�}@�cfk]W�B�]���������n<�,����� ������5a�Q@3V0 Ge$�+��vU'� �G�
�2CG��D[I��&�O���v�OQA����c�����F���X�l����;NGl6���S�t�i�4�q���;X���0������]�3��5���������4�N�x��'���������|���a�_<?��i_����/0���(��mB�s�?L�j��_�9���]�L'I�����:��������-�0�sDR��OT2u�K�s�a��{�����c�u���z|���������NA�8�Y��i����'����/{���yb���|�';�[r�Qe\�������zN��D���^�s������7M�k�[�,�"m^����}�����}�
���v��Hgv��SbZk0��sQf7k��R!���.A�V��������{g��j�:"�e���!���(�-���G!�R���N.�������D��@;�b��7{����O:���������X� ���������r.>�xsb���8����������nd�����k�8!� Z�:d�x�Y���-k?���������\������C�<����:2�H�|�9�K-{h>;Yvjh0����t��R?r���=����:��v���)� ���>��j�$�c��r����dpc]p�+DV�����|y8���8��>�x}��W��d}��
�tt@V�������O�Ev�~�-Z7TU;v�c�K�r8!��������/|a=-�Cp�z�m���O�r�w��>����������g���/�v�M���#�w�=�{�t�Wu?��#���>�Z��?�LLd�j��c�t��ab>���Z�2���F�W�>���Y��8��l`���z�=�5�T�l����1���w�.7��d���gm/���W�w�EFB��=!d �EtG��D�o�G�I�1`<�9�F��e%����<������9��@�.�����n2m��>h�`��Po�)�%!28.�}�%2F��Z
*������u��i���bi�o�����ni�v.�_JC�J�;��DG�\
���cp�y����y1�Z�.���p=������EE��O���B����F��aZP����v��6�����jk-"�`R�E�^��|���������E��.�*Z�u|��e$1�x�Ea\�K��w��e����B�KD�;��`x�u�k��z�2�y��1��A���� D���j�!)�`D�kc�7u(J[&a���k�g�\���w;Y�1�\k=���Q��E.���6�����t���1��A���S�@`|�z��]-�]�K.:�(�0�\3���s��-��2x��R�0n\&D�b����M3u(Z�(�qb�1.�>���|
��SK��)/�[�yF�������(w0�bp%�ST4�1����9l/S$F*�~)t1�v�Q�8@�"l]0�!�� x�������v�o�6�A�������?3f��>��.^b� e��-�4�^{�#=o����������t���)z�c8,���`�m��@���A����;�(1���F� c���0idP>��[��x�����~��K�#F������EiJ�3��NZ��5V���������~��X�|����!�0���!�Q{-��� $��J#7��G�������"���i��(;��jc�����So�.Z�u�q �A/|?���vC��������gB3��v��%���zMg�`1R+�1 `p�#�zv>�(��x^.��1������e���2����I5�3U��ay�>�>� ��r\�w;��V���u�����G�q
v���6}�oM��g���uI�-����2����^v��S���l�R�g\���I#'��^~W=��P�{ ���:*g���W�l����my����_Y~�0���\�s��1F1�z��;���Z�D^�������e��c����"�A��v�������G��:�k@��.08������>������yb�w�2�P��,y��T���1��Vk�yD�)���G��f�EW����!��y�;���)��"��Ga[����^����j�}W;^��@�5pp���Ftn��'-�a�2[JX��O��K�D�
�T������`���7��w����, ������c��yh�����xl�i�1�!�^;�>3�JN:_�L��n���.���0=�]����+�p ��v~�|R���1r����'_.������)����P����:�P����e�t���~\�
e�(k�����8j��Fq��4�:�I��,dU��+�I�������wK���sa�V��*���i W�$�zn1L�%o�&���|y��"Sv�;cD�y�e8����z��$���z���^KrI'�A+��\t�����!�������iC��� D(�N��[K��hCj�^x��F=(���P?��u��}Io�g�#}(��DQbh�"�����
�u�]G��N���dT1�<F�~�� r�
���6�R�P(����` �p�&�������8���@P�p��5��hk������{8�K���DP�{�O��^���~��J{�>R��!�y?���.a����r��E��*r0������S����:
��&�e��v,�O�;��}��2�b�"Z"m5�}��G������s���U:�"5=��q��D����7���q����zo�%��I�"r�t�sA�)��B��]�6s ��b�#��������������n���u��r�$����/��/�0���t�D�3��)��~��xT��4���z������o��Vf<G ��o�`a�8����s1)�I��5m�\���z[m8y���V(��=�;i+?:�;�N%I�(�]����}�k8���J`�jM��%M~(^?+n�k0�f�?e~y��%�*����p�"�g�����(���:d�J�#"�����d� [�|?���q�Y�r���y�O���'�<�����'l2@M�e����f��W��]��6���#k�V��I3���w�v^��9�H�qy�� �X����B�����C�[��9t��:�#��:u�g^r�� .���hG(N�e�#�
���;�T��|�}l����pY���/��Wd������ �!�R�.��[��ee)����� �C)O����,FY�����,��z&��qn���2�t}n9�<���
�/���kaz0�NSn�z���r6���������Dv���?m��f�J��s-�������f&���~��Q
�i�u�M��Y�������1vg��]��:���>������������
B���p��`p����a��~���^{�J����3��aj�s08�������R�.�@�~B�-�O"KH��{�/P�L�K�F�� �6��a�&�/�\����q�v��c�c\y�-+����!_P(�w��uI� r�&���3����8(*���\m ��hK��^���Fx�W��{0���v�p}��8�}\��a��������:U��
^g^��Wy�.�l,?@+D��2�������c����E�ySh����[�H�E�Q�A}��6��E#��b�4��A�\V\��f�
g_�7����o�ww.�3�r�e��E�0Gu��~����py��� �~�������]H*h����+�u�R#�����7��_]��z��%��5�f5��`��?���pX��Zq����zO=Q[��/��&�b�����Gu&��^�������v�2��t�i���u�K=:�a��������)</�t��'*�Q��V|�����6����U?*�5�}����s�
�'��]{�\�����e��ge�Ms
��������|!����K=s!R�������e��u]����7u��������'<R-{��!����2�~�2|2�G�TdSa�� y#�%�G ��o�������6B=tv&R~3��eP6���x����s�����K.Y(�4u��m�����{\`!d8 ���y/��;���f�"j��C[E�]�����-���<\V�0�{��5����YG=~c����f��x?m��qF���;������t�[J��B�Id����?�n���M�g�|��^�q�p��}-Lf�i�KS�b��a&{T�Lx&����+��p��D�C~\���M��#9-d�����{��](�����*��}���q��q����:[G�O�����t�t���uI���N��������_i�v.�4u��'\P*������gh�bd������Z����nRL� ��e���@d/4@�����0��������p��m�5�\v�8
c�����#�y�#{�z "$�����T�w;Y�1�\l����u3��8�%������k:/B�0�a��d�5�>��������n����>��m"�IYK5/�9N(�
M������Oe<�!�~Qi�op
��_��paDT�M/1
��TEA��z����O�j"���u K ��G���_j�1��x�}{ILe!�����(��\r�I'�>�U������MG�ep��������q�9]�Wu'�w
�m��)+��*~��m�+��7_w��{����������6�{�A�p�{�-��������8�.u�l
�����_���S��(T�Lk�����t�d����Xl�������
�-���y��R�>Q��
�6AkeV��X�`������Xg���uI���������A������{��|����[.��LBD����}H62@���m������DF� ������=����C��GD��:B;�>������g��^�z���nc��Y��9��&ub��|��~��{'j�A��.Y(���S@���6���C����?w��3��Y��:.Q2�\syk�+����~�\.�qh�����8y�h�i(����L�P���g���N����~'�C�9�^�Z���������z��u��3�1w-J������8�t�\����2m_���u:M=z��z�b���V�{����\
�8���O=�F�Z�'z����ky�4�:���oIy��CqdU�Ib����<Ay���90M����`�������
B���p���{R �����iU����BDS\B�}�
���W��P���?q�e^}�#u���{X���5;W�!d?�0���!�����. A�x�A�%����Cr��r[u���2���<�7�����W
�j��� ���~�m�k��)I]��K�{���%���q^r�(I�����E�#����
�^Js�P�,H���qr�S����k��
�;_5�`#����b���u���t�������0����������K�������w�n�my������n}�+���O�'s����C�A;B������}k\����:����{�.e<v������( �@��q���h!�~��XY��h`�;<��(�����J�HS?�_�c��r�Q���Vs'{#�F��N���1����v&��B�p�%�/�x`��vy���o<������������g����� ���oa�����n���:����Y��bY������`��0�rMs��&?����7�5�����~��2�m�B^���-;������Gu�����/T*"?��+.9�D��R�G�[J]���q��Rp�1���X����+K>�h�x�nf��n��Pn�����*��V+�U<3���t
����A�.Y(���O��4�1�����C~��;��p�>t/�7<�������kI�{�����"
G��$<\��0���9w��/���F��'s���G��U������w����v������$r������Uo���at�j~�Udu���#��q��S4s�m�.5
�=������l_��Y!M=K�m6��'�j#/#7�%W����=�/�|�ydR ��o����R��S������\G�%Y?��;�xY���p���.OP^����;�uZ�V������[��i�A:����K#���^�����H�Q����n���0����������.��q#o�}������QD�����b.R��@��u������Q���[�}'`��KE;��R��q��,�;�N�a�+d�x�-i��b����]�����~
���\��Oq���w���#Z�v
��Z���B(j�E[���]�M��8��V����115iT����Z�Hu���d���\��T����K�i��D:�)?�X�V[��(����]����:Z���4�q G
���+��2������J�]�3�c��W������/������;p�g�]�#���AP���) ����>I~]���+�j������e�������u{���}�!��e1Vx&��Po���w?N�}���t��8�`s*n�c��%��3�;�Ge�X������v�����E��~�$D�ku��Pj���jz<�1��8��\��V��k�R�����IsQ��?���&�������y>0j=��Na�[���yU@��'��
����Y�������rMs������(~(+^?+.���LY��Xe>��T�k0uL��k
*8T����H�H��� ��9h�U/Y2.?��?��9B$-rg\"����q���zv>��J�k��
e/�[��I��f�X�vT�|��H���Lv�=��*��!�d.����� ������:wq�(L�1]r�=?��*d���PWA�0�%K��N�|O��e9�����?��wA�o
�m��:�t��/hfL���:�P��^;�\����C�������t6,�]�]1F���C6B>m~8�������r�ywM$ ��l?M�f��uq>m_���u:M���{g�>]��V5)����}�������������������YG�2���O�K����v��M�o���q�q�W��GE4��)���I�����UV��$��������>+����i�Zm�EB>1�:�p�
�Ez���N%_��#'id������Kt?����+p��2�3I���P�<���5���8�Bg'�0��
��x����/DZ�y�!�8m������'����4���'k�Dh����^b�[��%��H]"v�(�6��(�e����Y�"L�u.�Q���\�
��I�'��V��<I�b�f�u��%�fn�{y��_��,
�f���}v.��������0�g�~�q�B�m��Ea����V���%�.-����v���7�H�����g�UW,��g��i��#u��:�
r� ������7��G����Y��AxF� pa��J&s�CyV�kWT������t�uU�m73)�qv���70�Q����I�w� �t��~�B����O����7Z���N�EV)��Ai�5�g����ai���Gi����K���'}�,���u��$�������������l������M��$��b
f-����O��:������?D!���1�R>c���8�BdEa�[�Q'!"�X;�Pa����6�����<�[n��tR':w&z�(j�������8��]!���7 ��8��I��k����{L��C�0���Cna��$��X]B\L�������~��M����{����&'K8R|5a��Y0{�����8}-����VD�AI�B��A�}��|���p�������������Q
�`��\�d�t�-mY5�^�>?�������i����;����{��a�V{M���V>���r�(���x����S\�eY��O(.�� ��;Y��x��8�����[)R0b���\"�"Q�o����Q�^�������q�^m�SQi�ri#�����o��t?�,(?n��uYA��/�0n�w?]����Q�~�b���u�e�1�b%���*�_9Z=��+���3K�}PC����-����J���9��&�3:"�r?J2�K����p�5o�������E������{*t[0J
����jYb���(:���@��`������������� em�=���)�!"���K[n����aY��|�Y�=�%�������c�(��&�fs�*���{.�r�C��i�]����
�o���u��j�� ��d\.^�T+��� ���a
R��2�D���2s
�:]�����|����)��������/_C�K��[�{)�M��|��"�~��j��������,I�?�5-i],6Y�Ca�~�\�n��������l�3�3m.����d��:����y��r`�����}�i�>��;5g^���-g�Q���ED����]V���UsM����+qd�=W�Tz��c�"�K�����8��tzN��)�;7�����a���g��/
�-������e}�Jqo��~�:c�^'?N�q�-#��$<\��0N]�&�x������F���$���:<N66�A��?L�q(l�<���=��������m�n�s���c�}����S�{��=N_��N��F�x���:��e�:k�,-��%�� �>j~v��,���+ -�2�H��=2w.U?�2�X?*�����\�d����BVMb���G������
�Y�q�B����0]��F� �,�������]b�vct*�!1S�:�����x���A^�B��l�`n�}�����W��P�BX &�d����� ~�r�f������v��(�Qv�I���=!�"�������B��Q���x��FT4La��4� �e!XPn���������_d����]���R���u�� c*\�/
�-���eDy�u# ���b�%��n�� ����R�-3�;��q|�Q�����j�r�w?�B��N&��xvm'EZ,��c�P�Q�!��{x�V~�F��5>�}��Lzb�~d�Q.k���Z�4Xn?���������Vs�u�/R���2H���/��e�+7����L������W��� �� ~�C���{�SAJ��ptZ�&jK���u�gB(�I59q�4�[�1���%��vd�p��5��4������2�~T,kp���rMKZ�9����x�\
�vo�������Q���<�`��A�ipm��4�A���]��j��l�~�K����r��1��� �d�����h�&�t��������u?�0�������?
����I��������A�m�����������E|B7���;�P������[����SS���"�q��{y~��']p��0.I�kV�����j{�q&5:�����q(+~8���&l�>��:xK+��l�
mh����}e�8}-�:m������9����%/\��8,�[����KA�N���r��T�s�5E��}�a���gT���������M��U7�}�g����F~k��������\[���G�K�~���d�$Xe��s�?mnV|G����'�3L�j�s��Q���G>�������r��v�1_������J�����=�����K5\(C!, �w�N�x�6P��0<�-62wF�?u+�^���?�0���{�]O��$�6��(�e����e!XPn�����y�}&�*Q�T'������eY��V�����C�-�3���F���|����*}��u��r;�>RP��LW�Kj�u+�����r�
�d�s�\9�T$n���$�j�0�r �Q���Q��`��;x���w���w���W_l<���U�a�%,k������=��������C�c�wI:�g5w�WX�m7U�E�����0G^��Y�� ��U��;J��1��_s����-�?��CAJ����8n�#��w��h��z"�H��g���5{�B���9�~T,kp���rMKZ�7����x}R����T�RK��5('5��e�]��'��S�k��>A��6�Z�,�;����P�l�q�R.h��\yH�s�~�>�R�k�����'_��$�c�k��dbo������W�gnwy��|YUp�(&nR��U��t��Zg�p)j�;x�]���hC�u�}V�
���pY����$�Yp�JY_g���UGo���O8��ke�G=��Ya��q���#s����[��M�������~�L�g�u3���eg�C�K�y�4=�@A���O��s���������Q�,)�~~u�sS�:��'����US�3S�>�x����:M���"I��$�����-�&1�f���x/���2��]���������3��7L�j��3���)d��p��0� sa��O�d���g�m��uaJ>������_,�
�P �����u(]I�Z������7W��ls���*����n�\���
�5=[�������i+$�6��(�e�}h� @ IDAT����e!XP�m�����"�.�3����Q4.Y��]uT*" i]�s�������wx�c��K{����������z�Tj�"������sv�~s������J(��������Si��D~���D����znc�>`���'�#��Pz�k����$�\������"9U�\�"�����O�x�DEF�������������nv������;O>fu����N��~{�S������I���r��k�7������]j���P��'��;�95�1'p=�����$��2Qk���l�FY��bY�sy����\�B�F�C�3-�������a��y�o���w8�r}Ws*�-;j
��$sOY����>�8��L
+�s�������c�e8��E�{����?o������������K�S��\G���e������i&s({���-N]/>�d)�8����{��U|s-�Q25i������b2�����n?AT2r%�.E�{�m���r�)����[Ny��?��8����o�����,���]��m` x��x��,���?�>#�{V�=/���cu�d�|��$���k�] :����S�w�!Rn���W�\� �>J���*��x��:Zr,k������:���c4��q���P]x�u~�-u��7������1/������z�{��>,���'�{����q�������{VY�o�q��:���Y�������_E�Z�������zGg��1�:����G(�r����W�� ���M� ix�uF*
�
.�H�7�{=��`
a���N���p������oc�/Kr=����B�������iP�w��c��a(9�
�?������7G),����8�yY��op��^:J��F��3��K��1RI�}(���A�JZ7��^��.��T��*a��;}���K�F����#h�6��K����@m��<o�������g�~�ip)���'-��\�e.<��#�^�no�y/��F]�J&Ro���}C=���M�HIG�(��fn�_Yp��M5�Bi��������vF���p�?��G�I�����:�������z�yEZ�4(�94�s��]a�����2�6V1A�I���:��R��|g����Dn]U;�'��:���:����XG/�@�(�~T,kp.�����K]h�$�P�>�����c
���^~O=��H������f�1
.[�9�t�)�5��EQ�\�D����$���~ur��a�=�h;/��Cz���m�"F�}��0������5/����/!����+��c"s-^������=4��s�j��V*nzj��'�-r�!������$���A������vM_������F�{��C��B�e/cR"%����q�I:^�0���B�H�S�~�f�O�}�����qX�����������=�6�����F�l�u`�"�����4�2�r��~��Co}�l��0rS/s]�g~����,m4$SW�Ea���\��y��b�@���:K;7Ea`�#���M�7��k�u�K���5��a�)7�O��t_��8r]����d%�&�*����ip��;���~�(J�j��Ug=d�NS_h��OA b�u��Re%5����~�Z
O&�.�t��bX�z��T���3�h����h����>��;�O�S=��P�BX $�d�|�MU�6��Q�S/�A�i������b
���p��
j���{��'���t_��=_z}�"�����a=�ri#��RX�};�q���,��
�`i���W(��[�}F��5��
��9�p]�������K]�/��`?��W��F�*�Y��SEwh}�����^S/��A�[��`?[<e���PF��|e�s��\d�E��$E�P���1����\�B��W�5Q�nzX���a��l��o�����|w�o�o���ny6��c�a�u�q�2�&�=�5�f5��_��?�������?��������GQ�}���t�3�'�C�\��k�H/�8�0�.4�7�s[?�d�{�)yH���;��1�O;s��;�T��|z�5��0�;B��aP������g���R:G~(M^�g��r�j���&�
)�����=�#� n�k0�I:���L��(m.�cO@����n�NusC��:���`��C�������_~2��$:���Fd ��y�v0<B;B����z��7K�#Xo�v���Q�DOx�F�q������Y���fAC_����K}��|�������[�u22?�?q��c�K�7c��X�I:����~&����q��t�fa�-9���oH� M����j���}����}��`V�x���u����s��
��<Z���6j����Q�{M���t��q���~��G3��!�>�~�����%�q�'Sz �:���#��f��N�3�'� ����~%/��������f���0O��
��|��%U�����6P�c,��Y������Y���{R����q���.Y�7�\�����3��e�G�Z�u���Tb��OA�,�e7�a��Ko� ��[I�r�qz����%�,R���2�����o�X!_z���j�0�� ����P���t4���\�r��nS���H��%��+o~��t���C3��Btg'���"�G �i�� I��E.mD���\���|�IV�E����XB�vy����Zi���pyEZ� "Z�w��`^�(��PN���Xz�����W���t�Qe}��ko3�A����c/#�{1��B��QGD���������zR���a��~��sS\�|�����]�e.�~�����N2\�W���������wCRK!�"��������gI���jypi3��Y����� :���� G��
�L���k��j�Q����s'� �\f���5�����{Bq��j��%�/y�M:�f�vy�g��c������.<Q5lP�� ��S��y���1�+�������{-Y� ����P`��s�u�l��CD�S�x`|�7?����e?*�58���bM��.I���x}�����
�?nR(��K�3�����h����7�5�z$�{�'�5��p����B��g�� ���o��k9<���y���5��SG�����\yH����Y>�sd���/�9����=`������'?7*��K��y� "}����4����g(�x�N9�id���}��8�A_>��6��&��[s�p�b�����i3���w�m>P������%2X�����g��i�����������Q�~��}�^�gd-ek ��6Y�D^3a�����pY�����t�fa�-O9�=6�8��x�����|��,�a�3�~g�����c�9���d���_x[��_��[{k.��o�?V�y�Kq�Z.�t.u�v����8������R��n*]2����6�h��9�������o���2>�Y�<�����^���b�{�������Ot�x.�~n�VI����#dk�i���
9�������:J~���k�n.���K!Ww?Om�^����{��������5)�.������b�+�K�G����~�����g.���~���r=�8@�����~�W�T#$�b��A���%���o����?�:�0'�S�b�0=�7�B%�@�>^����x�=��B��{&f�F
���:��#�S�k%���������Ba
eH2Y��
�k%8�P��+�k���@��0���=Dj��_yO}�����AD�'8�-��x��:J��(����D���[�!�=�!���~��F�Ga�ui����$}��������$�����O>�`R��5
#H��V{�`8��S����.��.q�
��e:���E����6H���6��~���k���Bg�M�uu
��zJ�� "�2�H�����_��5M�n���I���:�F�o��3����S���Q� �7
�9s��\z{k�L�b��;��R��\�B��q���l�D�L�^�U��B��F���b_�q� �
���6�y���HN�B���<��w�&
TO��`���gR����/%�~�K��(�NmN6��k���,������d��z��{�^�Q�0�<��[�����V'/�6��D����������J:�f�v���������g�����u�|�S���5����z����:��~�z���)ypH���t3�W����I����a����V,`��<�R��T�x���n���g<�;�3�~T�kp������k]b�C�{�C�>����k0�*(�y�]m��f��"�|�
��Qx����9�cW��i���!���=���n�������0��l�p���0D�����
�C;6�W������Yw�����2���CRvb-���T��V��q4}RG�N�<��3f9�f��uTVE�Z��w4�{�hv���%H�l���^��|�61��\j}�Q:m[s�H���|M�������k�ca� _\L\����<��>g�t2u`�A��c��LD@`8�=��@'���k\�OVDT#���~��W������q�Z�i#u�~;y���v�I�O�`@"���Cq������p��8��b����5.�)O9d��~�����3��o�d�+i����ao�A���&�Y�q"F���0j�t�^G;�1������q�=fq�b�Yx/��k���q
���`���=n}�B�p:S�:h��n����>���=�1?��qe|t5��r}}5a��\�\my�zujgc���a��^�n�������{{\��������{IM�i��'�������w��Y��=�A���Bip�� ��q ���)�:��5)+�.�����j�^�' V��v�������������;(/�\�u^��k�����:�9������g�oA
���"������� ����k f�m-d���K��
�/e&��,��EI�K�I������C�>��a@Q�u]���@F����X vn��V�g�~�~��I&kd<���,���}2^=�������M$���Vb������~S��USk���D_!=)�~q�(7�A������*#�6���
K�Mc�d%X$���K���0�����p����sc�xe��o�Z��L��t�gk���@���j���A��16P���9�J�����(s5=��a %��4�dA\��(�^-Vx��@���?TM-�Q` �mhF���\�B[��m�aS�3�T(DD�9�X��a�su�^�����@>���x}A��e5V��H[���S����htk��7�IA]E+�zw;�C������Q����?i�=��$�Y���Q>��#�rc���z2�,�c�V���np�(�����a��*J9��/�^�����_����=��P��/��FA�op����z����J�#��0&����gnq ������v�q~�h0�����n�#�eN'q�,�Q!����?�����Z�$�P����_;1����)k1�X�_�~���o 2�rm�k0��2�p_.k�'lxN��p�����r..��%~�h���r<�"1�Y��!�S��<Dc�MI�����q>�U��sG��!��%d0�\BiE�_�����o�f5����C��)?���}���^}kEY��a�>J;�Ybmd���B���7��;]�K#�D�R.<dR|.�t]ml��/���2m��NG"�9&���E�I�M\���i����W�}���[om2[��.���.V���}�^��{����8��F}&�9pu�~:�Y����D�)��L���)fk2��n:j���&?�������{���k����-1��;Z}��J��C��Z���$~����N'1�R/���N�d
��������������� �7��O2?s-�l�,�����-�������\)I��3�X?(���;����c����6��c���,����X��"
.e�9}%u�hst8����7lF�8kRr]��#mY����O\��=i��8�������8�{[vZ|����s����V��1�m���*��K���s�L���� �b�
A������x,1����������&����#��m���]� �5�x��g�t����z_���������xG{�{��)���{�Q\b��"����/6�l���Y��%�e��HQ�wm��Y�haCG-��#���E �`��S���b��"B��������m����`�l���l���\��F������*������S?Ckos�dq)�6�l����7>4�h����8�J���D\�6i��\��'�~�&����+��E��9@���q�����xU������Q���/��c>���� ��=�Q�&n��fj�m���2���w��Hi��:�p-Q��i�}�2�~ex�1w�FD��
c����c��,�\�x��V�����a��[�h;W1��1U���������oo��oR2S?��K�e��~����:�$�S����{Ij����z��&�X����� ������ksN�>�B�2���+����{A3���2R�=j��nm�-��};�6��g�����i�=�3���e5w2��<xO��&��pW `���-����g�|�Q\�>�y;.e�v���� �6:;���re�7�L6�����Njo�~z�t�S�7���l�U3�c���8R����������J��������}��Iv�{���Q���D�X>��K��]�����VVm�&�o��?��@��*�8����^�m�{�~���/����(�5��=~��gs<�5�:�`���cz��4���:`��#%}m����j��$�0<�f�,,��'_~�����h�/*�.�g
K��(d_�i�G�LB��Z����������I�� ������~�m���#�i���F�x@�c�C��G'��0�O��>g���*y�O�Zb�c�����m�������v>�#�����#�������.!k!����!�^�si�)/��j��������9�r��'��� }6=���y�a��s�k J�l���t�'�u�N42��C���lO�}���a��C��M�cJ�b���u&�Q��^�}��y&���w��:��Y�Bv�/���^7x�U�~l�6J3W�z�n������������8l�e���� :DkL�l"�����7F�A�:�+�����v[o���Vf�v��1����F�B�/��$�~��1��Hn9|O�vp������Q8nt�lh���-�q�\��\�d�H[V#Kq�����4�w.�P�{�N|��w$]wm��Z������ |��g�D8jL��7���B@�1�db!�A����_�m�I�E��IY�k�P������,4����2X�1<&�������HC�}.B8��0�1 �{� i���5lOa�Z����)�� �0 �*���t�=��{�a�x���}�;b� u[�\�� a�jw��R�BE��KQf�v��|�I.�CH@��wj��7�2,����y�/����}�L�O��f
��oq���#H���'�\�,� �R�vM����Ys���
���27a�A�"R\��o���!�)<}����k�A�#=sY� [���"m���x��/�+~0���e�e�w8�E����I��[F��4�^P�a�Y���~V��w�l�(�^�N�����Uu$?)���b���s �F{�D+�~K�G
e�%�}���~�������{L�k�5a��������Y�+����v[�0�������u[�����A�F9
O���/���*K��q���S�V��'�5���v�)������21�l��3��vC����wh�9z�N*����&y�k�����]X����P�&Y��:��_e���
D`
_��}%����gM[��3�Z<�����_���g��{{����DV��s/F|"���)�<����M��7<�����E�����=(�q�W����2��I��\��T���|���+�7{,~����3
l�>����u��0k7�0N>q)�������u�{~���"z<(#X�2>�������U����~���C��~��������A���"��1����#~���������"�z��\��\g1K�~�i����GV��g>�;Y(��A{E����d��w$]wy&�s.�V�E?A?b~I���~!A ���jr� P����c>D���,$�! �5�O;�n�p������A@�C����hG*2
�_�� �������'wW,�����dQ�^g��G�U��b}g�� �C���p��!��v}Ker�W�\%�� T>��[��\��H ]^�O2^s�'e\!A@��Xc�5��v���R��.w�@EA��C��h� g�u�P!A@�Y���O��XTD�M���8l�#$��E�����m������4�"���)�� ��79+��'J~��F�S���L��C,TH�2T��B����{V���g���=A e�z��y�s�a�����8)N���\��X^0!��;�O{�����o.�>3!<r� (���X�;�x5��o��:S[�D{������a�=�>�t|�-r^A@�# \�
�@"�g"�Pw- O�yF�R�$G��:���L:��
��M��|R[A��`��a�vR�� "�HH�C@������+��K��E�����X�ZR�H�\l�ASue�e�*�==�+�����S��}�]4����[��wkn�;�3I]3�^��.�� �7 9-�Z��v�n3�����a�G���@�B����k�5T��k�G���{��
�~�2�� ����:����,R�zVK���^�R� P �5�6��r$����r� Tv�aKu���*�.��m��>c����T����~��������|J��5��OA@A@���pc�$�e���+����j���x��������<A��!P���&�n���T�>���W�w��t8K+���k�����^�N�R� P��5�7��z k7n��_�F-��/���u�"�o��� ��WS;7�B��;��m�������
5��1���
�L��� �@ b�
F�� Q��u���9����lA�B!��l����'W�jUyA }V^��j���l_�>�Rb%F@��J�����4�h�rx��7�� A@(j�XC��SS�]�C�������n[�(uA@(T��[�-#�A@A@A@A@A@A@�J��p+]���� �� �� �� �� �� �@�" �Bm�� �� �� �� �� �� �� T:��[��\^XA@A@A@A@A@A@
1�j�H�A@A@A@A@A@A@A��! �J������ �� �� �� �� �� �� P���P[F�%�� �� �� �� �� �� �1�V�&�A@A@A@A@A@A@�BE@���2R/A@A@A@A@A@A@A@�t���5��� �� �� �� �� �� �� *b�-���z �� �� �� �� �� �� �@�C@������A@A@A@A@A@A@A�Pn����KA@A@A@A@A@A@*b��tM./,�� �� �� �� �� �� ���p�e�^�� �� �� �� �� P��R���R��Zi����S��W^�8�s������R�
��p+h��k �� �� �� �� �@�"���j��b�-�&��5n�n~yyA@( ��[ � UA@A@A@A@���@�jUU��B�@�" �m�� T��[i�Z^TA@A@A@A@�1��w��� �8(�5�� d��p��VJA@A@A@A@A��6�Z����E(T��[�-#�����p+KK�{
�� �� �� �� �@�"P}����m���<<.b����\'�@6�7\�TA@A�h�Z���z���������M�����@5�o}g��;s��^�*���*�Fi��:j���j���������V]Em�l=5��I�B�j�k��j�,R�g�)�*%�����-^�&O�>�<�
�Q
��Q_M�Q������A��# ����+�����5��� b�-�&�
"�l��j��j��_��/������B���IPM�WG��Z�t�zF������"�B�V{���<������U�H�������'���}�j�;������%��k7n��_�F-��/���uj�����_l��}�aj�=Z��o�Qw?�B�U�����Ym�
���+��5��S���X���4m�@������?U�K�y�(��{����Q/����~��b����+�����'U�����!�i#nq�����E���z~� m��_��?��(�!�JW�������V��Z�J����r�Tn�pJa�� $F@��!��������h����O�}��E�K(�_p��q��Mm^9Z=��+R3�Fy"p���C�o����M��>�<�"�.2��D7�`�Q�Wl�u3�������=�����Y��wmw��f�����P�=P.����R���������_g���f��{I����r�W�-�>v�>;�3N<�T�'�0zY�[s}�r�o�UVVw�n�q��#�[��-�:I*8�\���y����G]|��E7/��j���s�*�{$}r��J�i# ����A@H��p}��p�&j����7�����T�>U�CI� ^�C��W���2��z����+<F����@���Q���������U�=������pp��j�����O��^#+b`�8mY�o"}'q�(#�uj�P���B}���j�O�xO��]���R*��m��m�m��gmP�Z�q~���:K�4c�-ghr~|�>���������n���ST�����]��q�[���j��_��H�Y�����Fk�U���XiQ�D�_��z�p��b��|(��I����]��R];� ���p������� PQ�O����l�>Z#^~O=��H�+�PEG iX��j�]�3�G�;�3����w�"~?���t��%:���w��?.�}��6�P]��t�lm�y@��
����10em�/X�NtF�y�8��}������q��=�N��R���J�XA���}���H�j���_���z�B��q�XV������\�[[�|������{WY�����n����#�����a$�0t�\Y"��v���~��N�>)��Iu�����] ����n�l�
�Rb���M+/&E��p}j��m�>�oh������BUt�T����B`��T�{���t�F�Xq�Sq�_y�^�N4��Q4F�+�:�P���;#U�+n��.�;�q����@yp�l��:�G0�Y����+��c���Us
5|@����|����\�a���BD@t��*�]'1�w�U��������� b��ia�}@�d��T���-�@eT@��S%j��_U�N4��Q4F�+���E$���4��z�����vQ��O�q1�z���p��B�ULD7P1��<�*_���6Qo�������S�������P���~�q��������b]}}�����:���t6�e���YC5�{H�B.�SM���\n���}��3f�9��F�v[�5n�H��^S�t��U�k������L�����-\v{�s��U3�5]om�p�7g�������� j�����;��aU{��m��>��gUs��1��H�wA@�1��`.�(�����J�����n)H*�10U�������D�+Ec��B�^D��.On�OR-��T}<�k5����U���np��79S1�@�h�Bz�\
��j�T�����Vj%�/�������Q���_J]���{������=;\e��\�M����g.��v����3f�!}�3���u�}�W���~���[�W��w��j�������Cw>�~�n��y��u7XG�t�q�V�Z�S%����G5��7���SJ��}i{���:�KN�������F����>b�
BU��� P6��ga�}@�d��T���-��pKAR����B7o�/'}'^�(#�b��"��wyp����Z�iCU���npn06r�b ��������p�����y�Q���_1$b�]}���q�f�+-;��`���������"�T&��:����O2���������^�����������^PcF}Zr���n��T�JU�Z��4����@��]cC�<��~4��:������wo��a��7
�m�c~?�vOj���rL��C@���
���6YO�":���U��k�o������ss��Y�h����o�R���=�t=1�
5i�������Zl��&j�-6V����z�����(9�~���zj�M7P
t:���Qk����9k��U���N������t������U���=�p�W�#O�^r�z�4R�G�v��n��j�������/�f��?k~�\�����v�~s��������_j�N�1}�\5��/�����(e�%�'��&���k��_�t���l���������9�c�g/���wk���w�_�����v���]�Ou�M�n��-���F��z���������.[?>Qn�����N_S�^-� ��f����h��z����;`7���M����P�/1���A�|�
�Q�t���;�/����>[}��Wj�/3����}��V1���MM�����j�~/����c�Rk�5����i��t����MW����f��e�1c��I?�)F�������=��r�
T}=V���_�O�|�����=.������[m�v���
�]C�?}���5f��������b���#�T'O3�7�i�����Z��f|.��oxG�i�� ��Ze��j��V�l������TQ?������L���k�y����P�i���_)y��U���������
�k�����9��S�)�>���8UM������{;l�������_������n����f����Z��������;��r�����k�R���]��~�s�;������Q����\�j��#O����`�`,�Xs
�����P~Y��|���� ,��b#=G/��5j�a�%��I�g�J�v���)��f���,E�$���**-���-K��)$�B�J�"�,�JE��,Y�B�=[���{������<3�<����s|�3���;�~���{���s9��g/�r�{*��m�Q � �J���>|S��;�q����_U^|���V�;�>���z�Q�?��J���I��t���{��y�i�.��|E(��[y�����A�~M��&� 0�Z��W���R��^�����]��|��i���<c�k�a���Z��v�~�r�-�'��<F��53?qJ��06��i�������e���2��T�� ����1��Ys��c��K?�����o�'r
c�*w��|�c�h�����j��P��9�?��c�C
�-<��3����-Q�U(w�*CV��>c��q<^�
$7�zE;��uO�}FY�����pN���<v���C�����9�W
��������������o��U���y���Z���%�� ���6s�;s]�o�x�y��e(�70����7��]��b<�9\�R�H5����W��j��0��e�c.�wV������\�>���y �[i���� ��<��<b��]�X�o����J��|�nKh�q�-�
�������8��=��6oW��6nR��MT��&�[��b�U���KQ}\:~�0/����o �"���F�����������'\s^T���i���z�To�����o�������~f���4�MUjTRp�����������1v���Uh����Q:������ }h�$���Z�0��+T�PY�=g6}*������S����+}��E]�Nt��;C1����4����OP��7�C��3�����*��Y2��*���1z�
cl���k�[�j�SM���k7��o�������T�R������b%�����Y�_���w@W���:�����C���^����9�D������s�� �!p��l)���W�>�sZ�|M(y��OSV�-\���z>)���(/`L���
��PBa��$ �p�q��s�"�)��^������P��O5�rL�8 {(��L��I�\�����Q
�pi7l�A�^V��j�u;+I�+%�S~ n@���;E�9���� ���
'��k�R�I{�m�Q����������o��m�q��-�5$(>�n��o�3/D�����k���t5�xS����������\}�}�/\`K�_������x�:5*�r$�c� '��^�b��QM��V(������]����@��k�����:����3���wi��B��[>��tv���y�@��y�B�S��f_����C��Sn7NMu+���}��'UveY����e���� ����f?���h��h�$j�V��ek{�����|�������Y9_�r9�����i�~F
��Sz(�kT�
BN(�5PV�X�A��'��������8z���{�j�`h���p:����h���`��O��+�'�����}�dn�1�]:k�A0�� �%��y�-��m�yH)�Sd� ���l|W��E��'st����%A}��mG���~���������>�s^���<���N�������]T�Io�Q���58��W�tT��E/����2�uE�?[��f�Z�:+FA�_�M
���l���IN1)5���x����&���I_b��U�[���M�0��J�����d�����Q���&A2:}�1������p T�<�V����'���L.�AB�!p���~O�������@���9��0�w|_a�����K�#��Y<�a��@R�;:m�jNc#�Z���`���kz������ws�����79��@F��do8<a1�7���U��X�����Q��3���1�u&N�H��r�=I�s��
k���:u*]�[���c��� {<�y�6���L�5�<\��3��?8���������)0Oh�l#6HL��$�b~4a����<�I�A��p�h��fu��q��}�m��������u���>A��|�w�rl�r&�3����1�3e��Q�c��N����(�U�]����\4D�5/�7�����:����������1��^����'i��S�:�����>�P��:�y���f�%#���o;E�"�2���C;;I����!�.
�i���T�m�����������K�XN�?[jwY�X4���QvA@H��e��aT���4�K}x�������V�g����L<=G���
��[C_�B�a"�b�g�� F���3/*�������9P\����q%���;
qXF�� �c��V� �p�.��:z������_��<��@$-H��lm�^x��cr�2�2�W&������������9���@p���
XTg� ^^V(aoO��3lM6|����2���}�+���
M�A�AQz��z������ �*w�����1��7��R�k��('[)Cv��� �_� �F��r���'��.<f�mV��S9o*�WM�a�G6��A��b�� �eopx�B���D"����
��Sg)+�J��T�,�'�y�-���`�������m��.����M�&���AP^�;�H���3P���eo��b��� @ IDAT������U����u� ��n�z�70F���7,����������/xGvz� �?Z�x�����2x��?�LB<�<:�q"������J\X�;��W�[
6l�<B�P��9�]��� �@o.��sl��UV�2�:@�{X�Wc�s(� /�����U��?f?���X��!Mz&��T�z>xf�yB�s��iS��Y�� ~�}<��@����k ^�P�{�f���<P��:)����3��b��CSMt���s���������$������^15�����sO5PY}�-�w����k�f�L��~����;�����A��_��oo�?���(�
/X��c#(MO8+k����b�y��@?���J'��2���1L&b,���~�A
�_��z^�wi>��~��t���7c�[F
l�����d��i��{��A+ }�������y�������vI��7N �Y����h$6�_A]!C^���������~��+��Y��������<�}��OG����|��!�1��f'�w���W`�Z�Ri�q�r����0�Q���������m~�^xi"�y������7�!]�4Q�/�M���������
<|a��1�7�
�a���[��E���x8b� ���b����� FJ���} ��H;�;I��f}s�<&h��2(#�Dk��2����zDk�} ���y������\�a "��]�����}���}N���I��lcI1~2�M��1������w C��+��� ���W��#��6�4�����1�S� �Q2d����~��0'����Ql�"q��`L��Ece��"��c�5���=��y��jU�]�W�c��v �eb9��mK�lA����~�5���S<�����30�G���� =7�G6��Z>����v��y��.]��<��<1G�|;��bl���,�Y� �> ���_��������:���O�89����uO�g���M���>�z�G���C^7���h��p7������*�U����9�[����'v��2��
z�����?������R�
�����Z��-�K��G����y�����.�oW��^/u�9{j���U�pYF����
���A ��k������v� �0����k���g+�M��P��w����(��V�A��I��.���vyZM��^�fzS� �
v�9z��!��L7@}L����c������;
V� 7�B�a��=��p�4c�yK���8��"�@���]�!����
,��mB
��L69YG�h��y�(�du����U�5O�����Q�z(���� ��6`*gM��1�9��k�&�P��<�#���C���
j*�
�~\F;OS��9�����5�#�b� ������*�Y�H���&>� x��@�������Q�6nWQ�NC�)7Q��l� %(�s�K�C7lg=��m���PX�8A�v���G{��X�$]{`���+U�=QUm^�kB��o}� ^�t\���� ����W ��v��
U��y���������V/��{_�fZ�����e�d�v
|�
;1� SiM�=Xp�m����X�������_h�����m��|����]��O)�#B�;� �_�^y�/Q��U������zw���m���WwP^���f�k�Z/w�
C�=�U��<=^��X��A�{��g�qV���A�gW��=0���.&��hFf'f��,��
�(�<��������_����%VJ����(�n;P,[�x��5���������J`������Pd(HQ���Y���_v[���#0����%1�M4i� �*�i��(����R��E���`c����+�#`lq�"����w6)����~9�10e��k��)`t�An� ����~@���������6�eP���D�kAp�Im&����x�G�0��X���]ey�.|��Dz�C����
��&p�}�6V�'��t"pa`����x���/I�h;�#�;"b�?���������F��~_��� ��)0�|�#"8E��X��G1�0�M��y>�c9��wK�`#�w>����b�0�j��7��� ���7\�s������@�����=z��K�����k�F�1�7��?��&�k����~
�c��"� ���A���3Vc��~f~ \�{������d=x�F*�y���� \��=w��-��;��0���M�S����^�������K9��'CF���j�����~�_G��u^^�&!<��W�g�#���eK��'��L]�$�<w!p�P���� $B����w����
�����(�[��C��d�@o/A�8����J`����@T��84�r�Mb��,��k���;+�J��� 2n�G*l�5=������ �U�x�
gMc�F:���u�*�z�gyB��h��h&U�U��i ��)�Hb�:<�����@b"��]hV�m@+g������^2��yi�-6:-��� wS��k����=�h8f��5���T��F��[��^� r ^���2I# SLe�5���N�C�/+X����v���A�����a�!"W�7��3<��/��!�������s�)�^�� �kB�����I��w{�k�C}�I[��i�~")\�������0Np����_hg'
s������N��Q��(���Q���K��]Cs�{h�������j�w4�w�|GPp���Y9�k��zx���m��h�?0�������=�[���X(������c����{��N����E}� ��0&��9����N��E��q�:+%��7����P�c����d�n��}A8�ba����>#�OP�Os�!\&�����}�}���j y�<!����M�q�
�����o�����3O���;Q�7�w[��x�<�7,��u��Ez��
�#@�!���Z��`����pn���g
(D�����z�����-�\8C�+��}�aM:�Xk���4��mf)���W��-u:�m4.����'ok6��x�K "(�{5~�,������H��z��kD���]8A���}[)�N���L`7?�s�7"V8����h�O�f�g8��X��������M������v�V� qz���1�L�g���M�}T��qKga�5��h:�Q�"�h��p�K��Y�=?���c���.�� =���J�k�n�:nF(}e^��.�Y���Gzo���s�������n5�#Er�G�������X����[��h���
b�#�@�! �
�~�\����'�y-�������/�����V1�xBi�Pd����/wW�Sg�]F�|�<Ar���8,���b��<