Re: [HACKERS] make async slave to wait for lsn to be replayed

Started by Kartyshov Ivanalmost 3 years ago116 messages
#1Kartyshov Ivan
i.kartyshov@postgrespro.ru
3 attachment(s)

Intro==========
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside
application. We cannot store this lsn inside database, since reads are
distributed across all replicas and primary.

/messages/by-id/195e2d07ead315b1620f1a053313f490@postgrespro.ru

Suggestions
==========
Lots of proposals were made how this feature may look like.
I aggregate them into the following four types.

1) Classic (wait_classic_v1.patch)
/messages/by-id/3cc883048264c2e9af022033925ff8db@postgrespro.ru
==========
advantages: multiple events, standalone WAIT
disadvantages: new words in grammar

WAIT FOR  [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
    [ WAIT FOR [ANY | ALL] event [, ...]]
where event is one of:
LSN value
TIMEOUT number_of_milliseconds
timestamp

2) After style: Kyotaro and Freund (wait_after_within_v1.patch)
/messages/by-id/d3ff2e363af60b345f82396992595a03@postgrespro.ru
==========
advantages: no new words in grammar, standalone AFTER
disadvantages: a little harder to understand

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
    [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
    [ AFTER lsn_event [ WITHIN delay_milliseconds ]]

3) Procedure style: Tom Lane and Kyotaro (wait_proc_v1.patch)
/messages/by-id/27171.1586439221@sss.pgh.pa.us
/messages/by-id/20210121.173009.235021120161403875.horikyota.ntt@gmail.com
==========
advantages: no new words in grammar,like it made in
pg_last_wal_replay_lsn, no snapshots need
disadvantages: a little harder to remember names
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);

4) Brackets style: Kondratov
/messages/by-id/a8bff0350a27e0a87a6eaf0905d6737f@postgrespro.ru%C2%A0
==========
advantages: only one new word in grammar,like it made in VACUUM and
REINDEX, ability to extend parameters without grammar fixes
disadvantages: 
WAIT (LSN '16/B374D848', TIMEOUT 100);
BEGIN WAIT (LSN '16/B374D848' [, etc_options]);
...
COMMIT;

Consequence
==========
Below I provide the implementation of patches for the first three types.
I propose to discuss this feature again/

Regards

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

wait_after_within_v1.patchtext/x-diff; name=wait_after_within_v1.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..18695e013e 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..a2794763b1 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a83ff4551e 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..e4d2d8d0a1 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			* If we replayed an LSN that someone was waiting for,
+			* set latches in shared memory array to notify the waiter.
+			*/
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+			{
+				 WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..9a3b8192a6
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,329 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	XLogRecPtr	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn)
+		state->min_lsn = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (state->min_lsn == lsn_to_delete)
+	{
+		state->min_lsn = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < state->min_lsn)
+				state->min_lsn = state->waited_lsn[i];
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn = PG_UINT64_MAX;
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return state->min_lsn;
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint		res = 0;
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int		res = 0;
+
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												 CStringGetDatum(stmt->lsn)));
+	res = WaitUtility(lsn, stmt->delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index e892df9819..390bf9e27c 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -399,7 +399,6 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
 		default:
 
 			/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a0138382a1..24105eca4e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -310,7 +310,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -642,6 +642,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <ival>		wait_time
+%type <node>		wait_for
 
 
 /*
@@ -1064,6 +1066,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10838,12 +10841,13 @@ TransactionStmt:
 					n->chain = $3;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					$$ = (Node *) n;
 				}
 			| COMMIT opt_transaction opt_transaction_chain
@@ -10931,12 +10935,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					$$ = (Node *) n;
 				}
 			| END_P opt_transaction opt_transaction_chain
@@ -15622,6 +15627,37 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				AFTER LSN_value [WITHIN delay timestamp]
+ *
+ *****************************************************************************/
+WaitStmt:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+		;
+wait_for:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; }
+		;
+
+wait_time:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 
 /*
  * Aggregate decoration clauses
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..b7963ca9a2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -295,6 +297,11 @@ CreateSharedMemoryAndSemaphores(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index c7d9d96b45..b2ed7a8b21 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2833,6 +2851,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3481,6 +3503,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f7d7f10f7d..5e4cc547c5 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3393,6 +3393,7 @@ typedef struct TransactionStmt
 	char	   *savepoint_name; /* for savepoint commands */
 	char	   *gid;			/* for two-phase-commit related commands */
 	bool		chain;			/* AND CHAIN option */
+	Node		*wait;			/* wait lsn clause */
 } TransactionStmt;
 
 /* ----------------------
@@ -3938,4 +3939,16 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		AFTER Statement + AFTER clause of BEGIN statement
+ * ----------------------
+ */
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn;		/* LSN */
+	int			delay;		/* TIMEOUT */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index e738ac1c09..a95657a32f 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -216,3 +216,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "AFTER", false, false, false)
diff --git a/src/test/recovery/t/034_begin_after.pl b/src/test/recovery/t/034_begin_after.pl
new file mode 100644
index 0000000000..b22e63603c
--- /dev/null
+++ b/src/test/recovery/t/034_begin_after.pl
@@ -0,0 +1,86 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "AFTER" command
+#===============================================================================
+# We need to check that AFTER works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ AFTER '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/035_after.pl b/src/test/recovery/t/035_after.pl
new file mode 100644
index 0000000000..480ba6b33b
--- /dev/null
+++ b/src/test/recovery/t/035_after.pl
@@ -0,0 +1,99 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 2;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"AFTER '$lsn2' WITHIN 1");
+ok($reached_lsn eq "f", "AFTER doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	AFTER '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
wait_classic_v1.patchtext/x-diff; name=wait_classic_v1.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..18695e013e 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..b3af16c09f 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..1b54ed2084 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml
new file mode 100644
index 0000000000..26cae3ad85
--- /dev/null
+++ b/doc/src/sgml/ref/wait.sgml
@@ -0,0 +1,146 @@
+<!--
+doc/src/sgml/ref/wait.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait">
+ <indexterm zone="sql-wait">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for the target <acronym>LSN</acronym> to be replayed or for specified time to pass</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>where <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN value
+    TIMEOUT number_of_milliseconds
+    timestamp
+
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>'
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ALL LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>, TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ANY LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>WAIT FOR</command> provides a simple interprocess communication
+   mechanism to wait for timestamp, timeout or the target log sequence number
+   (<acronym>LSN</acronym>) on standby in <productname>PostgreSQL</productname>
+   databases with master-standby asynchronous replication. When run with the
+   <replaceable>LSN</replaceable> option, the <command>WAIT FOR</command>
+   command waits for the specified <acronym>LSN</acronym> to be replayed.
+   If no timestamp or timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using <literal>Ctrl+C</literal>, or
+   by shutting down the <literal>postgres</literal> server.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">LSN</replaceable></term>
+    <listitem>
+     <para>
+      Specify the target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>TIMEOUT <replaceable class="parameter">wait_timeout</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time interval to wait for the LSN to be replayed.
+      The specified <replaceable>wait_timeout</replaceable> must be an integer
+      and is measured in milliseconds.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>UNTIL TIMESTAMP <replaceable class="parameter">wait_time</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time to wait for the LSN to be replayed.
+      The specified <replaceable>wait_time</replaceable> must be timestamp.
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Run <literal>WAIT FOR</literal> from <application>psql</application>,
+   limiting wait time to 10000 milliseconds:
+
+<screen>
+WAIT FOR LSN '0/3F07A6B1' TIMEOUT 10000;
+NOTICE:  LSN is not reached. Try to increase wait time.
+LSN reached
+-------------
+ f
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Wait until the specified <acronym>LSN</acronym> is replayed:
+<screen>
+WAIT FOR LSN '0/3F07A611';
+LSN reached
+-------------
+ t
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Limit <acronym>LSN</acronym> wait time to 500000 milliseconds,
+   and then cancel the command if <acronym>LSN</acronym> was not reached:
+<screen>
+WAIT FOR LSN '0/3F0FF791' TIMEOUT 500000;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to increase wait time.
+ERROR:  canceling statement due to user request
+ LSN reached
+-------------
+ f
+(1 row)
+</screen>
+</para>
+ </refsect1>
+
+ <refsect1>
+  <title>Compatibility</title>
+
+  <para>
+   There is no <command>WAIT FOR</command> statement in the SQL
+   standard.
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a83ff4551e 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..5b3e3bab0d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,13 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..0bb43feb71
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,394 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	XLogRecPtr	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn)
+		state->min_lsn = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (state->min_lsn == lsn_to_delete)
+	{
+		state->min_lsn = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < state->min_lsn)
+				state->min_lsn = state->waited_lsn[i];
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn = PG_UINT64_MAX;
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return state->min_lsn;
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint		res = 0;
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	ListCell   *events;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	float8		delay = 0;
+	float8		final_delay = 0;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	XLogRecPtr	final_lsn = InvalidXLogRecPtr;
+	bool		has_lsn = false;
+	bool		wait_forever = true;
+	int			res = 0;
+
+	if (stmt->strategy == WAIT_FOR_ANY)
+	{
+		/* Prepare to find minimum LSN and delay */
+		final_delay = DBL_MAX;
+		final_lsn = PG_UINT64_MAX;
+	}
+
+	/* Extract options from the statement node tree */
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+
+		/* LSN to wait for */
+		if (event->lsn)
+		{
+			has_lsn = true;
+			lsn = DatumGetLSN(
+						DirectFunctionCall1(pg_lsn_in,
+							CStringGetDatum(event->lsn)));
+
+			/*
+			 * When waiting for ALL, select max LSN to wait for.
+			 * When waiting for ANY, select min LSN to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_lsn <= lsn) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_lsn > lsn))
+			{
+				final_lsn = lsn;
+			}
+		}
+
+		/* Time delay to wait for */
+		if (event->time || event->delay)
+		{
+			wait_forever = false;
+
+			if (event->delay)
+				delay = event->delay / 1000.0;
+
+			if (event->time)
+			{
+				Const *time = (Const *) event->time;
+				delay = WaitTimeResolve(time);
+			}
+
+			/*
+			 * When waiting for ALL, select max delay to wait for.
+			 * When waiting for ANY, select min delay to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_delay <= delay) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_delay > delay))
+			{
+				final_delay = delay;
+			}
+		}
+	}
+	if (wait_forever)
+		final_delay = 0;
+	if (!has_lsn)
+		final_lsn = InvalidXLogRecPtr;
+
+	res = WaitUtility(final_lsn, final_delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index e892df9819..c4a303748e 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -84,6 +84,7 @@ static Query *transformCreateTableAsStmt(ParseState *pstate,
 										 CreateTableAsStmt *stmt);
 static Query *transformCallStmt(ParseState *pstate,
 								CallStmt *stmt);
+static void transformWaitForStmt(ParseState *pstate, WaitStmt *stmt);
 static void transformLockingClause(ParseState *pstate, Query *qry,
 								   LockingClause *lc, bool pushedDown);
 #ifdef RAW_EXPRESSION_COVERAGE_TEST
@@ -399,7 +400,20 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
+		case T_WaitStmt:
+			transformWaitForStmt(pstate, (WaitStmt *) parseTree);
+			result = makeNode(Query);
+			result->commandType = CMD_UTILITY;
+			result->utilityStmt = (Node *) parseTree;
+			break;
+		case T_TransactionStmt:
+			{
+				TransactionStmt *stmt = (TransactionStmt *) parseTree;
+				if ((stmt->kind == TRANS_STMT_BEGIN ||
+						stmt->kind == TRANS_STMT_START) && stmt->wait)
+					transformWaitForStmt(pstate, (WaitStmt *) stmt->wait);
+			}
+			/* no break here - we want to fall through to the default */
 		default:
 
 			/*
@@ -3492,6 +3506,23 @@ applyLockingClause(Query *qry, Index rtindex,
 	qry->rowMarks = lappend(qry->rowMarks, rc);
 }
 
+/*
+ * transformWaitForStmt -
+ *	transform the WAIT FOR clause of the BEGIN statement
+ *	transform the WAIT FOR statement (TODO: remove this line if we don't keep it)
+ */
+static void
+transformWaitForStmt(ParseState *pstate, WaitStmt *stmt)
+{
+	ListCell   *events;
+
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+		event->time = transformExpr(pstate, event->time, EXPR_KIND_OTHER);
+	}
+}
+
 /*
  * Coverage testing for raw_expression_tree_walker().
  *
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a0138382a1..03708b5221 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -310,7 +310,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -534,7 +534,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	case_expr case_arg when_clause case_default
 %type <list>	when_clause_list
 %type <node>	opt_search_clause opt_cycle_clause
-%type <ival>	sub_type opt_materialized
+%type <ival>	sub_type wait_strategy opt_materialized
 %type <node>	NumericOnly
 %type <list>	NumericOnly_list
 %type <alias>	alias_clause opt_alias_clause opt_alias_clause_for_join_using
@@ -642,6 +642,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <list>		wait_list
+%type <node>		WaitEvent wait_for
 
 
 /*
@@ -712,7 +714,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -745,7 +747,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	SUBSCRIPTION SUBSTRING SUPPORT SYMMETRIC SYSID SYSTEM_P SYSTEM_USER
 
 	TABLE TABLES TABLESAMPLE TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN
-	TIES TIME TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
+	TIES TIME TIMEOUT TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
@@ -755,7 +757,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW
+	WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1064,6 +1067,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10838,12 +10842,13 @@ TransactionStmt:
 					n->chain = $3;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					$$ = (Node *) n;
 				}
 			| COMMIT opt_transaction opt_transaction_chain
@@ -10931,12 +10936,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					$$ = (Node *) n;
 				}
 			| END_P opt_transaction opt_transaction_chain
@@ -15622,6 +15628,74 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				WAIT FOR <event> [, <event> ...]
+ *				event is one of:
+ *					LSN value
+ *					TIMEOUT delay
+ *					timestamp
+ *
+ *****************************************************************************/
+WaitStmt:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			;
+wait_for:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; };
+
+wait_strategy:
+			ALL					{ $$ = WAIT_FOR_ALL; }
+			| ANY				{ $$ = WAIT_FOR_ANY; }
+			| /* EMPTY */		{ $$ = WAIT_FOR_ALL; }
+		;
+
+wait_list:
+			WaitEvent					{ $$ = list_make1($1); }
+			| wait_list ',' WaitEvent	{ $$ = lappend($1, $3); }
+			| wait_list WaitEvent		{ $$ = lappend($1, $2); }
+		;
+
+WaitEvent:
+			LSN Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = 0;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| TIMEOUT Iconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = $2;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| ConstDatetime Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = 0;
+					n->time = makeStringConstCast($2, @2, $1);
+					$$ = (Node *)n;
+				}
+			;
+
 
 /*
  * Aggregate decoration clauses
@@ -16853,6 +16927,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -16984,6 +17059,7 @@ unreserved_keyword:
 			| TEMPORARY
 			| TEXT_P
 			| TIES
+			| TIMEOUT
 			| TRANSACTION
 			| TRANSFORM
 			| TRIGGER
@@ -17010,6 +17086,7 @@ unreserved_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -17424,6 +17501,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -17585,6 +17663,7 @@ bare_label_keyword:
 			| THEN
 			| TIES
 			| TIME
+			| TIMEOUT
 			| TIMESTAMP
 			| TRAILING
 			| TRANSACTION
@@ -17622,6 +17701,7 @@ bare_label_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..106c6ac594 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -295,6 +297,11 @@ CreateSharedMemoryAndSemaphores(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 
+	/*
+	 * Init array of Latches in shared memory for WAIT
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index c7d9d96b45..b2ed7a8b21 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2833,6 +2851,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3481,6 +3503,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index f7d7f10f7d..0a1ac70991 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3393,6 +3393,7 @@ typedef struct TransactionStmt
 	char	   *savepoint_name; /* for savepoint commands */
 	char	   *gid;			/* for two-phase-commit related commands */
 	bool		chain;			/* AND CHAIN option */
+	Node		*wait;			/* WAIT clause: list of events to wait for */
 } TransactionStmt;
 
 /* ----------------------
@@ -3938,4 +3939,26 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		WAIT FOR Statement + WAIT FOR clause of BEGIN statement
+ *		TODO: if we only pick one, remove the other
+ * ----------------------
+ */
+
+typedef enum WaitForStrategy
+{
+	WAIT_FOR_ANY = 0,
+	WAIT_FOR_ALL
+} WaitForStrategy;
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	WaitForStrategy strategy;
+	List	   *events;		/* used as a pointer to the next WAIT event */
+	char	   *lsn;		/* WAIT FOR LSN */
+	int			delay;		/* WAIT FOR TIMEOUT */
+	Node	   *time;		/* WAIT FOR TIMESTAMP or TIME */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index bb36213e6f..13b16c4d94 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -249,6 +249,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -421,6 +422,7 @@ PG_KEYWORD("text", TEXT_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("then", THEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("ties", TIES, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("time", TIME, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("timeout", TIMEOUT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("timestamp", TIMESTAMP, COL_NAME_KEYWORD, BARE_LABEL)
 PG_KEYWORD("to", TO, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("trailing", TRAILING, RESERVED_KEYWORD, BARE_LABEL)
@@ -461,6 +463,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index e738ac1c09..c699cbb175 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -216,3 +216,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT FOR", false, false, false)
diff --git a/src/test/recovery/t/034_begin_wait.pl b/src/test/recovery/t/034_begin_wait.pl
new file mode 100644
index 0000000000..f1e5b5b23d
--- /dev/null
+++ b/src/test/recovery/t/034_begin_wait.pl
@@ -0,0 +1,146 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "WAIT FOR" command
+#===============================================================================
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ WAIT FOR LSN '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/035_wait.pl b/src/test/recovery/t/035_wait.pl
new file mode 100644
index 0000000000..6f9d549416
--- /dev/null
+++ b/src/test/recovery/t/035_wait.pl
@@ -0,0 +1,145 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
wait_proc_v1.patchtext/x-diff; name=wait_proc_v1.patchDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..422bb1ed82 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..3f94ef4ec3
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,291 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements waitlsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xact.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(&state->mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+				state->lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/*
+ * Set latches in shared memory to signal that new LSN has been replayed
+ */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int			backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+
+		if (backend && state->lsn[i] != 0 &&
+			state->lsn[i] <= cur_lsn)
+		{
+			SetLatch(&backend->procLatch);
+		}
+	}
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Get minimal LSN that someone waits for
+ */
+XLogRecPtr
+GetMinWaitedLSN(void)
+{
+	return state->min_lsn.value;
+}
+
+/*
+ * On WAIT use a latch to wait till LSN is replayed,
+ * postmaster dies or timeout happens. Timeout is specified in milliseconds.
+ * Returns true if LSN was reached and false otherwise.
+ */
+bool
+WaitUtility(XLogRecPtr target_lsn, const int timeout_ms)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	bool		res = false;
+	bool		wait_forever = (timeout_ms <= 0);
+
+	/*
+	 * In transactions, that have isolation level repeatable read or higher
+	 * waitlsn creates a snapshot if called first in a block, which can
+	 * lead the transaction to working incorrectly
+	 */
+
+	if (IsTransactionBlock() && XactIsoLevel != XACT_READ_COMMITTED) {
+		ereport(WARNING,
+				errmsg("Waitlsn may work incorrectly in this isolation level"),
+				errhint("Call waitlsn before starting the transaction"));
+	}
+
+	endtime = GetNowFloat() + timeout_ms / 1000.0;
+
+	latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	/* Check if we already reached the needed LSN */
+	if (cur_lsn >= target_lsn)
+		return true;
+
+	AddWaitedLSN(target_lsn);
+
+	for (;;)
+	{
+		int			rc;
+		float8		time_left = 0;
+		long		time_left_ms = 0;
+
+		time_left = endtime - GetNowFloat();
+
+		/* Use 100 ms as the default timeout to check for interrupts */
+		if (wait_forever || time_left < 0 || time_left > 0.1)
+			time_left_ms = 100;
+		else
+			time_left_ms = (long) ceil(time_left * 1000.0);
+
+		/* If interrupt, LockErrorCleanup() will do DeleteWaitedLSN() for us */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, time_left_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		ResetLatch(MyLatch);
+
+		if (rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+
+		if (rc & WL_TIMEOUT)
+		{
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+			/* If the time specified by user has passed, stop waiting */
+			time_left = endtime - GetNowFloat();
+			if (!wait_forever && time_left <= 0.0)
+				break;
+		}
+
+		/* If LSN has been replayed */
+		if (target_lsn <= cur_lsn)
+			break;
+	}
+
+	DeleteWaitedLSN();
+
+	if (cur_lsn < target_lsn)
+		ereport(WARNING,
+				 errmsg("LSN was not reached"),
+				 errhint("Try to increase wait time."));
+	else
+		res = true;
+
+	return res;
+}
+
+Datum
+pg_waitlsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+	uint64_t		delay = PG_GETARG_INT32(1);
+
+	PG_RETURN_BOOL(WaitUtility(trg_lsn, delay));
+}
+
+Datum
+pg_waitlsn_infinite(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+
+	PG_RETURN_BOOL(WaitUtility(trg_lsn, 0));
+}
+
+Datum
+pg_waitlsn_no_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+
+	PG_RETURN_BOOL(WaitUtility(trg_lsn, 1));
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6446da2d6..e148c0ea24 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -59,6 +59,7 @@
 #include "access/xlogrecovery.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
+#include "commands/wait.h"
 #include "common/ip.h"
 #include "funcapi.h"
 #include "libpq/pqformat.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..760d760356 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -295,6 +297,11 @@ CreateSharedMemoryAndSemaphores(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 
+	/*
+	 * Init array of events for the wait clause in shared memory
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..313da0b1fc 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -704,6 +705,9 @@ LockErrorCleanup(void)
 
 	AbortStrongLockAcquire();
 
+	/* If waitlsn was interrupted, then stop waiting for that LSN */
+	DeleteWaitedLSN();
+
 	/* Nothing to do if we weren't waiting for a lock */
 	if (lockAwaited == NULL)
 	{
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 220ddb8c01..fd731595b2 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -384,8 +384,6 @@ pg_sleep(PG_FUNCTION_ARGS)
 	 * less than the specified time when WaitLatch is terminated early by a
 	 * non-query-canceling signal such as SIGHUP.
 	 */
-#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
-
 	endtime = GetNowFloat() + secs;
 
 	for (;;)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66b73c3900..99b0c02c34 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11927,4 +11927,19 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_waitlsn', prorettype => 'bool', proargtypes => 'pg_lsn int8',
+  proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_waitlsn' },
+
+{ oid => '16388', descr => 'wait for LSN for an infinite time',
+  proname => 'pg_waitlsn_infinite', prorettype => 'bool', proargtypes => 'pg_lsn',
+  proargnames => '{trg_lsn}',
+  prosrc => 'pg_waitlsn_infinite' },
+
+{ oid => '16389', descr => 'wait for LSN with no timeout',
+  proname => 'pg_waitlsn_no_wait', prorettype => 'bool', proargtypes => 'pg_lsn',
+  proargnames => '{trg_lsn}',
+  prosrc => 'pg_waitlsn_no_wait' },
+
 ]
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..fd21e43416
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+#include "nodes/parsenodes.h"
+
+extern bool WaitUtility(XLogRecPtr lsn, const int timeout_ms);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWaitedLSN(void);
+extern void DeleteWaitedLSN(void);
+
+#endif							/* WAIT_H */
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index edd59dc432..db2926f965 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -140,4 +140,6 @@ extern int	date2isoyearday(int year, int mon, int mday);
 
 extern bool TimestampTimestampTzRequiresRewrite(void);
 
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 1000000.0)
+
 #endif							/* TIMESTAMP_H */
diff --git a/src/test/recovery/t/034_waitlsn.pl b/src/test/recovery/t/034_waitlsn.pl
new file mode 100644
index 0000000000..54a20e089c
--- /dev/null
+++ b/src/test/recovery/t/034_waitlsn.pl
@@ -0,0 +1,74 @@
+# Checks waitlsn
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Using the backup, create a streaming standby with a 1 second delay
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# Check that timeouts make us wait for the specified time (1s here)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $two_seconds = 2000; # in milliseconds
+my $start_time = time();
+$node_standby->safe_psql('postgres',
+	"SELECT pg_waitlsn('0/FFFFFFFF', $two_seconds)");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $two_seconds, "waitlsn waits for enough time");
+
+# Check that timeouts let us stop waiting right away, before reaching target LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my ($ret, $out, $err) = $node_standby->psql('postgres',
+	"SELECT pg_waitlsn('$lsn1', 1)");
+
+ok($ret == 0, "zero return value when failed to waitlsn on standby");
+ok($err =~ /WARNING:  LSN was not reached/,
+	"correct error message when failed to waitlsn on standby");
+ok($out eq "f", "if given too little wait time, WAIT doesn't reach target LSN");
+
+
+# Check that waitlsn works fine and reaches target LSN if given no timeout
+
+# Add data on primary, memorize primary's last LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Wait for it to appear on replica, memorize replica's last LSN
+$node_standby->safe_psql('postgres',
+	"SELECT pg_waitlsn_infinite('$lsn2')");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+
+# Make sure that primary's and replica's LSNs are the same after WAIT
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$reached_lsn'::pg_lsn, '$lsn2'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"standby reached the same LSN as primary before starting transaction");
+
+$node_standby->stop;
+$node_primary->stop;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 22ea42c16b..6413a661cf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2964,6 +2964,7 @@ WaitEventIPC
 WaitEventSet
 WaitEventTimeout
 WaitPMResult
+WaitState
 WalCloseMethod
 WalCompression
 WalLevel
#2Greg Stark
stark@mit.edu
In reply to: Kartyshov Ivan (#1)

On Tue, 28 Feb 2023 at 05:13, Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote:

Below I provide the implementation of patches for the first three types.
I propose to discuss this feature again/

Oof, that doesn't really work with the cfbot. It tries to apply all
three patches and of course the second and third fail to apply.

In any case this seems like a lot of effort to me. I would suggest you
just pick one avenue and provide that patch for discussion and just
ask whether people would prefer any of the alternative syntaxes.

Fwiw I prefer the functions approach. I do like me some nice syntax
but I don't see any particular advantage of the special syntax in this
case. They don't seem to provide any additional expressiveness.

That said, I'm not a fan of the specific function names. Remember that
we have polymorphic functions so you could probably just have an
option argument:

pg_lsn_wait('LSN', [timeout]) returns boolean

(just call it with a timeout of 0 to do a no-wait)

I'll set the patch to "Waiting on Author" for now. If you feel you're
still looking for more opinions from others maybe set it back to Needs
Review but honestly there are a lot of patches so you probably won't
see much this commitfest unless you have a patch that shows in
cfbot.cputube.org as applying and which looks ready to commit.

--
greg

#3Michael Paquier
michael@paquier.xyz
In reply to: Greg Stark (#2)

On Wed, Mar 01, 2023 at 03:31:06PM -0500, Greg Stark wrote:

Fwiw I prefer the functions approach. I do like me some nice syntax
but I don't see any particular advantage of the special syntax in this
case. They don't seem to provide any additional expressiveness.

So do I, eventhough I saw a point that sticking to a function or a
procedure approach makes the wait stick with more MVCC rules, like the
fact that the wait may be holding a snapshot for longer than
necessary. The grammar can be more extensible without more keywords
with DefElems, still I'd like to think that we should not introduce
more restrictions in the parser if we have ways to work around it.
Using a procedure or function approach is more extensible in its own
ways, and it also depends on the data waiting for (there could be more
than one field as well for a single wait pattern?).

I'll set the patch to "Waiting on Author" for now. If you feel you're
still looking for more opinions from others maybe set it back to Needs
Review but honestly there are a lot of patches so you probably won't
see much this commitfest unless you have a patch that shows in
cfbot.cputube.org as applying and which looks ready to commit.

While looking at all the patches proposed, I have noticed that all the
approaches proposed force a wakeup of the waiters in the redo loop of
the startup process for each record, before reading the next record.
It strikes me that there is some interaction with custom resource
managers here, where it is possible to poke at the waiters not for
each record, but after reading some specific records. Something
out-of-core would not be as responsive as the per-record approach,
still responsive enough that the waiters wait on input for an
acceptable amount of time, depending on the frequency of the records
generated by a primary to wake them up. Just something that popped
into my mind while looking a bit at the threads.
--
Michael

#4Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Kartyshov Ivan (#1)

On 28.02.23 11:10, Kartyshov Ivan wrote:

3) Procedure style: Tom Lane and Kyotaro (wait_proc_v1.patch)
/messages/by-id/27171.1586439221@sss.pgh.pa.us
/messages/by-id/20210121.173009.235021120161403875.horikyota.ntt@gmail.com
==========
advantages: no new words in grammar,like it made in
pg_last_wal_replay_lsn, no snapshots need
disadvantages: a little harder to remember names
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);

Of the presented options, I prefer this one. (Maybe with a "_" between
"wait" and "lsn".)

But I wonder how a client is going to get the LSN. How would all of
this be used by a client? I can think of a scenarios where you have an
application that issues a bunch of SQL commands and you have some kind
of pooler in the middle that redirects those commands to different
hosts, and what you really want is to have it transparently behave as if
it's just a single host. Do we want to inject a bunch of "SELECT
pg_get_lsn()", "SELECT pg_wait_lsn()" calls into that?

I'm tempted to think this could be a protocol-layer facility. Every
query automatically returns the current LSN, and every query can also
send along an LSN to wait for, and the client library would just keep
track of the LSN for (what it thinks of as) the connection. So you get
some automatic serialization without having to modify your client code.

That said, exposing this functionality using functions could be a valid
step in that direction, so that you can at least build out the actual
internals of the functionality and test it out.

#5Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Peter Eisentraut (#4)
1 attachment(s)

Here I made new patch of feature, discussed above.

WAIT FOR procedure - waits for certain lsn on pause
==========
Synopsis
==========
SELECT pg_wait_lsn(‘LSN’, timeout) returns boolean

Where timeout = 0, will wait infinite without timeout
And if timeout = 1, then just check if lsn was replayed

How to use it
==========

Greg Stark wrote:

That said, I'm not a fan of the specific function names. Remember that
we have polymorphic functions so you could probably just have an
option argument:

If you have any example, I will be glade to see them. Ьy searches have
not been fruitful.

Michael Paquier wrote:

While looking at all the patches proposed, I have noticed that all the
approaches proposed force a wakeup of the waiters in the redo loop of
the startup process for each record, before reading the next record.
It strikes me that there is some interaction with custom resource
managers here, where it is possible to poke at the waiters not for
each record, but after reading some specific records. Something
out-of-core would not be as responsive as the per-record approach,
still responsive enough that the waiters wait on input for an
acceptable amount of time, depending on the frequency of the records
generated by a primary to wake them up. Just something that popped
into my mind while looking a bit at the threads.

I`ll work on this idea to have less impact on the redo system.

On 2023-03-02 13:33, Peter Eisentraut wrote:

But I wonder how a client is going to get the LSN. How would all of
this be used by a client?

As I wrote earlier main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside application.

I'm tempted to think this could be a protocol-layer facility. Every
query automatically returns the current LSN, and every query can also
send along an LSN to wait for, and the client library would just keep
track of the LSN for (what it thinks of as) the connection. So you
get some automatic serialization without having to modify your client
code.

Yes it sounds very tempted. But I think community will be against it.

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

wait_proc_v2.patchtext/x-diff; name=wait_proc_v2.patchDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..422bb1ed82 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..3465673f0a
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,275 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements wait lsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xact.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(&state->mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+				state->lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/*
+ * Set latches in shared memory to signal that new LSN has been replayed
+ */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int			backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+
+		if (backend && state->lsn[i] != 0 &&
+			state->lsn[i] <= cur_lsn)
+		{
+			SetLatch(&backend->procLatch);
+		}
+	}
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Get minimal LSN that someone waits for
+ */
+XLogRecPtr
+GetMinWaitedLSN(void)
+{
+	return state->min_lsn.value;
+}
+
+/*
+ * On WAIT use a latch to wait till LSN is replayed,
+ * postmaster dies or timeout happens. Timeout is specified in milliseconds.
+ * Returns true if LSN was reached and false otherwise.
+ */
+bool
+WaitUtility(XLogRecPtr target_lsn, const int timeout_ms)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	bool		res = false;
+	bool		wait_forever = (timeout_ms <= 0);
+
+	/*
+	 * In transactions, that have isolation level repeatable read or higher
+	 * wait lsn creates a snapshot if called first in a block, which can
+	 * lead the transaction to working incorrectly
+	 */
+
+	if (IsTransactionBlock() && XactIsoLevel != XACT_READ_COMMITTED) {
+		ereport(WARNING,
+				errmsg("Waitlsn may work incorrectly in this isolation level"),
+				errhint("Call wait lsn before starting the transaction"));
+	}
+
+	endtime = GetNowFloat() + timeout_ms / 1000.0;
+
+	latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	/* Check if we already reached the needed LSN */
+	if (cur_lsn >= target_lsn)
+		return true;
+
+	AddWaitedLSN(target_lsn);
+
+	for (;;)
+	{
+		int			rc;
+		float8		time_left = 0;
+		long		time_left_ms = 0;
+
+		time_left = endtime - GetNowFloat();
+
+		/* Use 100 ms as the default timeout to check for interrupts */
+		if (wait_forever || time_left < 0 || time_left > 0.1)
+			time_left_ms = 100;
+		else
+			time_left_ms = (long) ceil(time_left * 1000.0);
+
+		/* If interrupt, LockErrorCleanup() will do DeleteWaitedLSN() for us */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, time_left_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		ResetLatch(MyLatch);
+
+		if (rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+
+		if (rc & WL_TIMEOUT)
+		{
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+			/* If the time specified by user has passed, stop waiting */
+			time_left = endtime - GetNowFloat();
+			if (!wait_forever && time_left <= 0.0)
+				break;
+		}
+
+		/* If LSN has been replayed */
+		if (target_lsn <= cur_lsn)
+			break;
+	}
+
+	DeleteWaitedLSN();
+
+	if (cur_lsn < target_lsn)
+		ereport(WARNING,
+				 errmsg("LSN was not reached"),
+				 errhint("Try to increase wait time."));
+	else
+		res = true;
+
+	return res;
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+	uint64_t		delay = PG_GETARG_INT32(1);
+
+	PG_RETURN_BOOL(WaitUtility(trg_lsn, delay));
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..760d760356 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -295,6 +297,11 @@ CreateSharedMemoryAndSemaphores(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 
+	/*
+	 * Init array of events for the wait clause in shared memory
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..1dec418992 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -704,6 +705,9 @@ LockErrorCleanup(void)
 
 	AbortStrongLockAcquire();
 
+	/* If wait lsn was interrupted, then stop waiting for that LSN */
+	DeleteWaitedLSN();
+
 	/* Nothing to do if we weren't waiting for a lock */
 	if (lockAwaited == NULL)
 	{
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 220ddb8c01..fd731595b2 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -384,8 +384,6 @@ pg_sleep(PG_FUNCTION_ARGS)
 	 * less than the specified time when WaitLatch is terminated early by a
 	 * non-query-canceling signal such as SIGHUP.
 	 */
-#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
-
 	endtime = GetNowFloat() + secs;
 
 	for (;;)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66b73c3900..04dde03d72 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11927,4 +11927,9 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prorettype => 'bool', proargtypes => 'pg_lsn int8',
+  proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 ]
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..fd21e43416
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+#include "nodes/parsenodes.h"
+
+extern bool WaitUtility(XLogRecPtr lsn, const int timeout_ms);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWaitedLSN(void);
+extern void DeleteWaitedLSN(void);
+
+#endif							/* WAIT_H */
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index edd59dc432..db2926f965 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -140,4 +140,6 @@ extern int	date2isoyearday(int year, int mon, int mday);
 
 extern bool TimestampTimestampTzRequiresRewrite(void);
 
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 1000000.0)
+
 #endif							/* TIMESTAMP_H */
diff --git a/src/test/recovery/t/034_waitlsn.pl b/src/test/recovery/t/034_waitlsn.pl
new file mode 100644
index 0000000000..dc9e899671
--- /dev/null
+++ b/src/test/recovery/t/034_waitlsn.pl
@@ -0,0 +1,76 @@
+# Checks wait lsn
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Using the backup, create a streaming standby with a 1 second delay
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# Check that timeouts make us wait for the specified time (1s here)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $two_seconds = 2000; # in milliseconds
+my $start_time = time();
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('0/FFFFFFFF', $two_seconds)");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $two_seconds, "wait lsn waits for enough time");
+
+# Check that timeouts let us stop waiting right away, before reaching target LSN
+# Wait for no wait
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my ($ret, $out, $err) = $node_standby->psql('postgres',
+	"SELECT pg_wait_lsn('$lsn1', 1)");
+
+ok($ret == 0, "zero return value when failed to wait lsn on standby");
+ok($err =~ /WARNING:  LSN was not reached/,
+	"correct error message when failed to wait lsn on standby");
+ok($out eq "f", "if given too little wait time, WAIT doesn't reach target LSN");
+
+
+# Check that wait lsn works fine and reaches target LSN if given no timeout
+# Wait for infinite
+
+# Add data on primary, memorize primary's last LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Wait for it to appear on replica, memorize replica's last LSN
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('$lsn2', 0)");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+
+# Make sure that primary's and replica's LSNs are the same after WAIT
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$reached_lsn'::pg_lsn, '$lsn2'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"standby reached the same LSN as primary before starting transaction");
+
+$node_standby->stop;
+$node_primary->stop;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 22ea42c16b..6413a661cf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2964,6 +2964,7 @@ WaitEventIPC
 WaitEventSet
 WaitEventTimeout
 WaitPMResult
+WaitState
 WalCloseMethod
 WalCompression
 WalLevel
#6Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Peter Eisentraut (#4)
1 attachment(s)

Update patch to fix conflict with master
--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

wait_proc_v3.patchtext/x-diff; name=wait_proc_v3.patchDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..422bb1ed82 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..3465673f0a
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,275 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements wait lsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xact.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(&state->mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+				state->lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/*
+ * Set latches in shared memory to signal that new LSN has been replayed
+ */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int			backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+
+		if (backend && state->lsn[i] != 0 &&
+			state->lsn[i] <= cur_lsn)
+		{
+			SetLatch(&backend->procLatch);
+		}
+	}
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Get minimal LSN that someone waits for
+ */
+XLogRecPtr
+GetMinWaitedLSN(void)
+{
+	return state->min_lsn.value;
+}
+
+/*
+ * On WAIT use a latch to wait till LSN is replayed,
+ * postmaster dies or timeout happens. Timeout is specified in milliseconds.
+ * Returns true if LSN was reached and false otherwise.
+ */
+bool
+WaitUtility(XLogRecPtr target_lsn, const int timeout_ms)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	bool		res = false;
+	bool		wait_forever = (timeout_ms <= 0);
+
+	/*
+	 * In transactions, that have isolation level repeatable read or higher
+	 * wait lsn creates a snapshot if called first in a block, which can
+	 * lead the transaction to working incorrectly
+	 */
+
+	if (IsTransactionBlock() && XactIsoLevel != XACT_READ_COMMITTED) {
+		ereport(WARNING,
+				errmsg("Waitlsn may work incorrectly in this isolation level"),
+				errhint("Call wait lsn before starting the transaction"));
+	}
+
+	endtime = GetNowFloat() + timeout_ms / 1000.0;
+
+	latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	/* Check if we already reached the needed LSN */
+	if (cur_lsn >= target_lsn)
+		return true;
+
+	AddWaitedLSN(target_lsn);
+
+	for (;;)
+	{
+		int			rc;
+		float8		time_left = 0;
+		long		time_left_ms = 0;
+
+		time_left = endtime - GetNowFloat();
+
+		/* Use 100 ms as the default timeout to check for interrupts */
+		if (wait_forever || time_left < 0 || time_left > 0.1)
+			time_left_ms = 100;
+		else
+			time_left_ms = (long) ceil(time_left * 1000.0);
+
+		/* If interrupt, LockErrorCleanup() will do DeleteWaitedLSN() for us */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, time_left_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		ResetLatch(MyLatch);
+
+		if (rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+
+		if (rc & WL_TIMEOUT)
+		{
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+			/* If the time specified by user has passed, stop waiting */
+			time_left = endtime - GetNowFloat();
+			if (!wait_forever && time_left <= 0.0)
+				break;
+		}
+
+		/* If LSN has been replayed */
+		if (target_lsn <= cur_lsn)
+			break;
+	}
+
+	DeleteWaitedLSN();
+
+	if (cur_lsn < target_lsn)
+		ereport(WARNING,
+				 errmsg("LSN was not reached"),
+				 errhint("Try to increase wait time."));
+	else
+		res = true;
+
+	return res;
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+	uint64_t		delay = PG_GETARG_INT32(1);
+
+	PG_RETURN_BOOL(WaitUtility(trg_lsn, delay));
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..760d760356 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -295,6 +297,11 @@ CreateSharedMemoryAndSemaphores(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 
+	/*
+	 * Init array of events for the wait clause in shared memory
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..1dec418992 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -704,6 +705,9 @@ LockErrorCleanup(void)
 
 	AbortStrongLockAcquire();
 
+	/* If wait lsn was interrupted, then stop waiting for that LSN */
+	DeleteWaitedLSN();
+
 	/* Nothing to do if we weren't waiting for a lock */
 	if (lockAwaited == NULL)
 	{
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 5d78d6dc06..fbc96db198 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -384,8 +384,6 @@ pg_sleep(PG_FUNCTION_ARGS)
 	 * less than the specified time when WaitLatch is terminated early by a
 	 * non-query-canceling signal such as SIGHUP.
 	 */
-#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
-
 	endtime = GetNowFloat() + secs;
 
 	for (;;)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 505595620e..0979c25217 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11931,6 +11931,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prorettype => 'bool', proargtypes => 'pg_lsn int8',
+  proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '8981', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..fd21e43416
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+#include "nodes/parsenodes.h"
+
+extern bool WaitUtility(XLogRecPtr lsn, const int timeout_ms);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWaitedLSN(void);
+extern void DeleteWaitedLSN(void);
+
+#endif							/* WAIT_H */
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index edd59dc432..db2926f965 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -140,4 +140,6 @@ extern int	date2isoyearday(int year, int mon, int mday);
 
 extern bool TimestampTimestampTzRequiresRewrite(void);
 
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 1000000.0)
+
 #endif							/* TIMESTAMP_H */
diff --git a/src/test/recovery/t/034_waitlsn.pl b/src/test/recovery/t/034_waitlsn.pl
new file mode 100644
index 0000000000..dc9e899671
--- /dev/null
+++ b/src/test/recovery/t/034_waitlsn.pl
@@ -0,0 +1,76 @@
+# Checks wait lsn
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Using the backup, create a streaming standby with a 1 second delay
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# Check that timeouts make us wait for the specified time (1s here)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $two_seconds = 2000; # in milliseconds
+my $start_time = time();
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('0/FFFFFFFF', $two_seconds)");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $two_seconds, "wait lsn waits for enough time");
+
+# Check that timeouts let us stop waiting right away, before reaching target LSN
+# Wait for no wait
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my ($ret, $out, $err) = $node_standby->psql('postgres',
+	"SELECT pg_wait_lsn('$lsn1', 1)");
+
+ok($ret == 0, "zero return value when failed to wait lsn on standby");
+ok($err =~ /WARNING:  LSN was not reached/,
+	"correct error message when failed to wait lsn on standby");
+ok($out eq "f", "if given too little wait time, WAIT doesn't reach target LSN");
+
+
+# Check that wait lsn works fine and reaches target LSN if given no timeout
+# Wait for infinite
+
+# Add data on primary, memorize primary's last LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Wait for it to appear on replica, memorize replica's last LSN
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('$lsn2', 0)");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+
+# Make sure that primary's and replica's LSNs are the same after WAIT
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$reached_lsn'::pg_lsn, '$lsn2'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"standby reached the same LSN as primary before starting transaction");
+
+$node_standby->stop;
+$node_primary->stop;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 86a9303bf5..374f1bc267 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2968,6 +2968,7 @@ WaitEventIPC
 WaitEventSet
 WaitEventTimeout
 WaitPMResult
+WaitState
 WalCloseMethod
 WalCompression
 WalLevel
#7Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Peter Eisentraut (#4)
1 attachment(s)

Fix build.meson troubles

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

wait_proc_v4.patchtext/x-diff; name=wait_proc_v4.patchDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..422bb1ed82 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..3465673f0a
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,275 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements wait lsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xact.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(&state->mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+				state->lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/*
+ * Set latches in shared memory to signal that new LSN has been replayed
+ */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int			backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+
+		if (backend && state->lsn[i] != 0 &&
+			state->lsn[i] <= cur_lsn)
+		{
+			SetLatch(&backend->procLatch);
+		}
+	}
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Get minimal LSN that someone waits for
+ */
+XLogRecPtr
+GetMinWaitedLSN(void)
+{
+	return state->min_lsn.value;
+}
+
+/*
+ * On WAIT use a latch to wait till LSN is replayed,
+ * postmaster dies or timeout happens. Timeout is specified in milliseconds.
+ * Returns true if LSN was reached and false otherwise.
+ */
+bool
+WaitUtility(XLogRecPtr target_lsn, const int timeout_ms)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	bool		res = false;
+	bool		wait_forever = (timeout_ms <= 0);
+
+	/*
+	 * In transactions, that have isolation level repeatable read or higher
+	 * wait lsn creates a snapshot if called first in a block, which can
+	 * lead the transaction to working incorrectly
+	 */
+
+	if (IsTransactionBlock() && XactIsoLevel != XACT_READ_COMMITTED) {
+		ereport(WARNING,
+				errmsg("Waitlsn may work incorrectly in this isolation level"),
+				errhint("Call wait lsn before starting the transaction"));
+	}
+
+	endtime = GetNowFloat() + timeout_ms / 1000.0;
+
+	latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	/* Check if we already reached the needed LSN */
+	if (cur_lsn >= target_lsn)
+		return true;
+
+	AddWaitedLSN(target_lsn);
+
+	for (;;)
+	{
+		int			rc;
+		float8		time_left = 0;
+		long		time_left_ms = 0;
+
+		time_left = endtime - GetNowFloat();
+
+		/* Use 100 ms as the default timeout to check for interrupts */
+		if (wait_forever || time_left < 0 || time_left > 0.1)
+			time_left_ms = 100;
+		else
+			time_left_ms = (long) ceil(time_left * 1000.0);
+
+		/* If interrupt, LockErrorCleanup() will do DeleteWaitedLSN() for us */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, time_left_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		ResetLatch(MyLatch);
+
+		if (rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+
+		if (rc & WL_TIMEOUT)
+		{
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+			/* If the time specified by user has passed, stop waiting */
+			time_left = endtime - GetNowFloat();
+			if (!wait_forever && time_left <= 0.0)
+				break;
+		}
+
+		/* If LSN has been replayed */
+		if (target_lsn <= cur_lsn)
+			break;
+	}
+
+	DeleteWaitedLSN();
+
+	if (cur_lsn < target_lsn)
+		ereport(WARNING,
+				 errmsg("LSN was not reached"),
+				 errhint("Try to increase wait time."));
+	else
+		res = true;
+
+	return res;
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+	uint64_t		delay = PG_GETARG_INT32(1);
+
+	PG_RETURN_BOOL(WaitUtility(trg_lsn, delay));
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..760d760356 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -295,6 +297,11 @@ CreateSharedMemoryAndSemaphores(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 
+	/*
+	 * Init array of events for the wait clause in shared memory
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 22b4278610..1dec418992 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -704,6 +705,9 @@ LockErrorCleanup(void)
 
 	AbortStrongLockAcquire();
 
+	/* If wait lsn was interrupted, then stop waiting for that LSN */
+	DeleteWaitedLSN();
+
 	/* Nothing to do if we weren't waiting for a lock */
 	if (lockAwaited == NULL)
 	{
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 5d78d6dc06..fbc96db198 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -384,8 +384,6 @@ pg_sleep(PG_FUNCTION_ARGS)
 	 * less than the specified time when WaitLatch is terminated early by a
 	 * non-query-canceling signal such as SIGHUP.
 	 */
-#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
-
 	endtime = GetNowFloat() + secs;
 
 	for (;;)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 505595620e..0979c25217 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11931,6 +11931,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prorettype => 'bool', proargtypes => 'pg_lsn int8',
+  proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '8981', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..fd21e43416
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+#include "nodes/parsenodes.h"
+
+extern bool WaitUtility(XLogRecPtr lsn, const int timeout_ms);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWaitedLSN(void);
+extern void DeleteWaitedLSN(void);
+
+#endif							/* WAIT_H */
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index edd59dc432..db2926f965 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -140,4 +140,6 @@ extern int	date2isoyearday(int year, int mon, int mday);
 
 extern bool TimestampTimestampTzRequiresRewrite(void);
 
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 1000000.0)
+
 #endif							/* TIMESTAMP_H */
diff --git a/src/test/recovery/t/034_waitlsn.pl b/src/test/recovery/t/034_waitlsn.pl
new file mode 100644
index 0000000000..dc9e899671
--- /dev/null
+++ b/src/test/recovery/t/034_waitlsn.pl
@@ -0,0 +1,76 @@
+# Checks wait lsn
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Using the backup, create a streaming standby with a 1 second delay
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# Check that timeouts make us wait for the specified time (1s here)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $two_seconds = 2000; # in milliseconds
+my $start_time = time();
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('0/FFFFFFFF', $two_seconds)");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $two_seconds, "wait lsn waits for enough time");
+
+# Check that timeouts let us stop waiting right away, before reaching target LSN
+# Wait for no wait
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my ($ret, $out, $err) = $node_standby->psql('postgres',
+	"SELECT pg_wait_lsn('$lsn1', 1)");
+
+ok($ret == 0, "zero return value when failed to wait lsn on standby");
+ok($err =~ /WARNING:  LSN was not reached/,
+	"correct error message when failed to wait lsn on standby");
+ok($out eq "f", "if given too little wait time, WAIT doesn't reach target LSN");
+
+
+# Check that wait lsn works fine and reaches target LSN if given no timeout
+# Wait for infinite
+
+# Add data on primary, memorize primary's last LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Wait for it to appear on replica, memorize replica's last LSN
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('$lsn2', 0)");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+
+# Make sure that primary's and replica's LSNs are the same after WAIT
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$reached_lsn'::pg_lsn, '$lsn2'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"standby reached the same LSN as primary before starting transaction");
+
+$node_standby->stop;
+$node_primary->stop;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 86a9303bf5..374f1bc267 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2968,6 +2968,7 @@ WaitEventIPC
 WaitEventSet
 WaitEventTimeout
 WaitPMResult
+WaitState
 WalCloseMethod
 WalCompression
 WalLevel
#8Картышов Иван
i.kartyshov@postgrespro.ru
In reply to: Kartyshov Ivan (#7)
1 attachment(s)

All rebased and tested

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company@postgrespro.ru>
 

Attachments:

wait_proc_v5.patchtext/x-patchDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..c7460bd9b8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..2f9be4f9ad
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,275 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements wait lsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2023, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xact.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(&state->mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+				state->lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/*
+ * Set latches in shared memory to signal that new LSN has been replayed
+ */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int			backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+
+		if (backend && state->lsn[i] != 0 &&
+			state->lsn[i] <= cur_lsn)
+		{
+			SetLatch(&backend->procLatch);
+		}
+	}
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Get minimal LSN that someone waits for
+ */
+XLogRecPtr
+GetMinWaitedLSN(void)
+{
+	return state->min_lsn.value;
+}
+
+/*
+ * On WAIT use a latch to wait till LSN is replayed,
+ * postmaster dies or timeout happens. Timeout is specified in milliseconds.
+ * Returns true if LSN was reached and false otherwise.
+ */
+bool
+WaitUtility(XLogRecPtr target_lsn, const int timeout_ms)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	bool		res = false;
+	bool		wait_forever = (timeout_ms <= 0);
+
+	/*
+	 * In transactions, that have isolation level repeatable read or higher
+	 * wait lsn creates a snapshot if called first in a block, which can
+	 * lead the transaction to working incorrectly
+	 */
+
+	if (IsTransactionBlock() && XactIsoLevel != XACT_READ_COMMITTED) {
+		ereport(WARNING,
+				errmsg("Waitlsn may work incorrectly in this isolation level"),
+				errhint("Call wait lsn before starting the transaction"));
+	}
+
+	endtime = GetNowFloat() + timeout_ms / 1000.0;
+
+	latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	/* Check if we already reached the needed LSN */
+	if (cur_lsn >= target_lsn)
+		return true;
+
+	AddWaitedLSN(target_lsn);
+
+	for (;;)
+	{
+		int			rc;
+		float8		time_left = 0;
+		long		time_left_ms = 0;
+
+		time_left = endtime - GetNowFloat();
+
+		/* Use 100 ms as the default timeout to check for interrupts */
+		if (wait_forever || time_left < 0 || time_left > 0.1)
+			time_left_ms = 100;
+		else
+			time_left_ms = (long) ceil(time_left * 1000.0);
+
+		/* If interrupt, LockErrorCleanup() will do DeleteWaitedLSN() for us */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, time_left_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		ResetLatch(MyLatch);
+
+		if (rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+
+		if (rc & WL_TIMEOUT)
+		{
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+			/* If the time specified by user has passed, stop waiting */
+			time_left = endtime - GetNowFloat();
+			if (!wait_forever && time_left <= 0.0)
+				break;
+		}
+
+		/* If LSN has been replayed */
+		if (target_lsn <= cur_lsn)
+			break;
+	}
+
+	DeleteWaitedLSN();
+
+	if (cur_lsn < target_lsn)
+		ereport(WARNING,
+				 errmsg("LSN was not reached"),
+				 errhint("Try to increase wait time."));
+	else
+		res = true;
+
+	return res;
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+	uint64_t		delay = PG_GETARG_INT32(1);
+
+	PG_RETURN_BOOL(WaitUtility(trg_lsn, delay));
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 8f1ded7338..760d760356 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -295,6 +297,11 @@ CreateSharedMemoryAndSemaphores(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 
+	/*
+	 * Init array of events for the wait clause in shared memory
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index dac921219f..92e354c848 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -704,6 +705,9 @@ LockErrorCleanup(void)
 
 	AbortStrongLockAcquire();
 
+	/* If wait lsn was interrupted, then stop waiting for that LSN */
+	DeleteWaitedLSN();
+
 	/* Nothing to do if we weren't waiting for a lock */
 	if (lockAwaited == NULL)
 	{
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 5d78d6dc06..fbc96db198 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -384,8 +384,6 @@ pg_sleep(PG_FUNCTION_ARGS)
 	 * less than the specified time when WaitLatch is terminated early by a
 	 * non-query-canceling signal such as SIGHUP.
 	 */
-#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
-
 	endtime = GetNowFloat() + secs;
 
 	for (;;)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6996073989..ebff5cadcd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12035,6 +12035,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prorettype => 'bool', proargtypes => 'pg_lsn int8',
+  proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..bd52664de9
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2023, Regents of PostgresPro
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+#include "nodes/parsenodes.h"
+
+extern bool WaitUtility(XLogRecPtr lsn, const int timeout_ms);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWaitedLSN(void);
+extern void DeleteWaitedLSN(void);
+
+#endif							/* WAIT_H */
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index c4dd96c8c9..6d763d1175 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -144,4 +144,6 @@ extern int	date2isoyearday(int year, int mon, int mday);
 
 extern bool TimestampTimestampTzRequiresRewrite(void);
 
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 1000000.0)
+
 #endif							/* TIMESTAMP_H */
diff --git a/src/test/recovery/t/034_waitlsn.pl b/src/test/recovery/t/034_waitlsn.pl
new file mode 100644
index 0000000000..dc9e899671
--- /dev/null
+++ b/src/test/recovery/t/034_waitlsn.pl
@@ -0,0 +1,76 @@
+# Checks wait lsn
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Using the backup, create a streaming standby with a 1 second delay
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# Check that timeouts make us wait for the specified time (1s here)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $two_seconds = 2000; # in milliseconds
+my $start_time = time();
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('0/FFFFFFFF', $two_seconds)");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $two_seconds, "wait lsn waits for enough time");
+
+# Check that timeouts let us stop waiting right away, before reaching target LSN
+# Wait for no wait
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my ($ret, $out, $err) = $node_standby->psql('postgres',
+	"SELECT pg_wait_lsn('$lsn1', 1)");
+
+ok($ret == 0, "zero return value when failed to wait lsn on standby");
+ok($err =~ /WARNING:  LSN was not reached/,
+	"correct error message when failed to wait lsn on standby");
+ok($out eq "f", "if given too little wait time, WAIT doesn't reach target LSN");
+
+
+# Check that wait lsn works fine and reaches target LSN if given no timeout
+# Wait for infinite
+
+# Add data on primary, memorize primary's last LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Wait for it to appear on replica, memorize replica's last LSN
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('$lsn2', 0)");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+
+# Make sure that primary's and replica's LSNs are the same after WAIT
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$reached_lsn'::pg_lsn, '$lsn2'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"standby reached the same LSN as primary before starting transaction");
+
+$node_standby->stop;
+$node_primary->stop;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 260854747b..17e1d17660 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2992,6 +2992,7 @@ WaitEventIPC
 WaitEventSet
 WaitEventTimeout
 WaitPMResult
+WaitState
 WalCloseMethod
 WalCompression
 WalLevel
#9Alexander Korotkov
aekorotkov@gmail.com
In reply to: Картышов Иван (#8)

Hi, Ivan!

On Fri, Jun 30, 2023 at 11:32 AM Картышов Иван <i.kartyshov@postgrespro.ru>
wrote:

All rebased and tested

Thank you for continuing to work on this patch.

I see you're concentrating on the procedural version of this feature. But
when you're calling a procedure within a normal SQL statement, the executor
gets a snapshot and holds it until the procedure finishes. In the case the
WAL record conflicts with this snapshot, the query will be canceled.
Alternatively, when hot_standby_feedback = on, the query and WAL replayer
will be in a deadlock (WAL replayer will wait for the query to finish, and
the query will wait for WAL replayed). Do you see this issue? Or do you
think I'm missing something?

XLogRecPtr
GetMinWaitedLSN(void)
{
return state->min_lsn.value;
}

You definitely shouldn't access directly the fields
inside pg_atomic_uint64. In this particular case, you should
use pg_atomic_read_u64().

Also, I think there is a race condition.

/* Check if we already reached the needed LSN */
if (cur_lsn >= target_lsn)
return true;

AddWaitedLSN(target_lsn);

Imagine, PerformWalRecovery() will replay a record after the check, but
before AddWaitedLSN(). This code will start the waiting cycle even if the
LSN is already achieved. Surely this cycle will end soon because it
rechecks LSN value each 100 ms. But anyway, I think there should be
another check after AddWaitedLSN() for the sake of consistency.

------
Regards,
Alexander Korotkov

#10Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#9)

Hi!

On Wed, Oct 4, 2023 at 1:22 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I see you're concentrating on the procedural version of this feature. But when you're calling a procedure within a normal SQL statement, the executor gets a snapshot and holds it until the procedure finishes. In the case the WAL record conflicts with this snapshot, the query will be canceled. Alternatively, when hot_standby_feedback = on, the query and WAL replayer will be in a deadlock (WAL replayer will wait for the query to finish, and the query will wait for WAL replayed). Do you see this issue? Or do you think I'm missing something?

I'm sorry, I actually meant hot_standby_feedback = off
(hot_standby_feedback = on actually avoids query conflicts). I
managed to reproduce this problem.

master: create table test as (select i from generate_series(1,10000) i);
slave conn1: select pg_wal_replay_pause();
master: delete from test;
master: vacuum test;
master: select pg_current_wal_lsn();
slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0);
slave conn1: select pg_wal_replay_resume();
slave conn2: ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.

Needless to say, this is very undesirable behavior. This happens
because pg_wait_lsn() has to run within a snapshot as any other
function. This is why I think this functionality should be
implemented as a separate statement.

Another issue I found is that pg_wait_lsn() hangs on the primary. I
think an error should be reported instead.

------
Regards,
Alexander Korotkov

#11Картышов Иван
i.kartyshov@postgrespro.ru
In reply to: Alexander Korotkov (#10)
1 attachment(s)

Alexander, thank you for your review and pointing this issues. According to
them I made some fixes and rebase all patch.

But I can`t repeat your ERROR. Not with hot_standby_feedback = on nor 
hot_standby_feedback = off.master: create table test as (select i from generate_series(1,10000) i);
slave conn1: select pg_wal_replay_pause();
master: delete from test;
master: vacuum test;
master: select pg_current_wal_lsn();
slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0);
slave conn1: select pg_wal_replay_resume();
slave conn2: ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.Also I use little hack to work out of snapshot similar to SnapshotResetXmin.

Patch rebased and ready for review.

 

Attachments:

wait_proc_v6.patchtext/x-patchDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..edbbe208cb 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1780,6 +1781,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..7db208c726
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,284 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements wait lsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2023, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(&state->mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+				state->lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/*
+ * Set latches in shared memory to signal that new LSN has been replayed
+ */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int			backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+
+		if (backend && state->lsn[i] != 0 &&
+			state->lsn[i] <= cur_lsn)
+		{
+			SetLatch(&backend->procLatch);
+		}
+	}
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Get minimal LSN that someone waits for
+ */
+XLogRecPtr
+GetMinWaitedLSN(void)
+{
+	return pg_atomic_read_u64(&state->min_lsn);
+}
+
+/*
+ * On WAIT use a latch to wait till LSN is replayed,
+ * postmaster dies or timeout happens. Timeout is specified in milliseconds.
+ * Returns true if LSN was reached and false otherwise.
+ */
+bool
+WaitUtility(XLogRecPtr target_lsn, const int timeout_ms)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	bool		res = false;
+	bool		wait_forever = (timeout_ms <= 0);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Work only in standby mode")));
+
+	/*
+	 * In transactions, that have isolation level repeatable read or higher
+	 * wait lsn creates a snapshot if called first in a block, which can
+	 * lead the transaction to working incorrectly
+	 */
+
+	if (IsTransactionBlock() && XactIsoLevel != XACT_READ_COMMITTED) {
+		ereport(WARNING,
+				errmsg("Waitlsn may work incorrectly in this isolation level"),
+				errhint("Call wait lsn before starting the transaction"));
+	}
+
+	endtime = GetNowFloat() + timeout_ms / 1000.0;
+
+	latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	/* Check if we already reached the needed LSN */
+	if (cur_lsn >= target_lsn)
+		return true;
+
+	AddWaitedLSN(target_lsn);
+
+	for (;;)
+	{
+		int			rc;
+		float8		time_left = 0;
+		long		time_left_ms = 0;
+
+		/* If LSN has been replayed */
+		if (target_lsn <= cur_lsn)
+			break;
+
+		time_left = endtime - GetNowFloat();
+
+		/* Use 100 ms as the default timeout to check for interrupts */
+		if (wait_forever || time_left < 0 || time_left > 0.1)
+			time_left_ms = 100;
+		else
+			time_left_ms = (long) ceil(time_left * 1000.0);
+
+		/* If interrupt, LockErrorCleanup() will do DeleteWaitedLSN() for us */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		/* A little hack similar to SnapshotResetXmin to work out of snapshot */
+		MyProc->xmin = InvalidTransactionId;
+
+		rc = WaitLatch(MyLatch, latch_events, time_left_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		ResetLatch(MyLatch);
+
+		if (rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+
+		if (rc & WL_TIMEOUT)
+		{
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+			/* If the time specified by user has passed, stop waiting */
+			time_left = endtime - GetNowFloat();
+			if (!wait_forever && time_left <= 0.0)
+				break;
+		}
+	}
+
+	DeleteWaitedLSN();
+
+	if (cur_lsn < target_lsn)
+		ereport(WARNING,
+				 errmsg("LSN was not reached"),
+				 errhint("Try to increase wait time."));
+	else
+		res = true;
+
+	return res;
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+	uint64_t		delay = PG_GETARG_INT32(1);
+
+	PG_RETURN_BOOL(WaitUtility(trg_lsn, delay));
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index a3d8eacb8d..cf3c570353 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -142,6 +143,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
@@ -303,6 +305,11 @@ CreateSharedMemoryAndSemaphores(void)
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
 
+	/*
+	 * Init array of events for the wait clause in shared memory
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e9e445bb21..35be6dd44d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -704,6 +705,9 @@ LockErrorCleanup(void)
 
 	AbortStrongLockAcquire();
 
+	/* If wait lsn was interrupted, then stop waiting for that LSN */
+	DeleteWaitedLSN();
+
 	/* Nothing to do if we weren't waiting for a lock */
 	if (lockAwaited == NULL)
 	{
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 5d78d6dc06..fbc96db198 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -384,8 +384,6 @@ pg_sleep(PG_FUNCTION_ARGS)
 	 * less than the specified time when WaitLatch is terminated early by a
 	 * non-query-canceling signal such as SIGHUP.
 	 */
-#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
-
 	endtime = GetNowFloat() + secs;
 
 	for (;;)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fb58dee3bc..0859ec6682 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12092,6 +12092,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prorettype => 'bool', proargtypes => 'pg_lsn int8',
+  proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..bd52664de9
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2023, Regents of PostgresPro
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+#include "nodes/parsenodes.h"
+
+extern bool WaitUtility(XLogRecPtr lsn, const int timeout_ms);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWaitedLSN(void);
+extern void DeleteWaitedLSN(void);
+
+#endif							/* WAIT_H */
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index c4dd96c8c9..6d763d1175 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -144,4 +144,6 @@ extern int	date2isoyearday(int year, int mon, int mday);
 
 extern bool TimestampTimestampTzRequiresRewrite(void);
 
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 1000000.0)
+
 #endif							/* TIMESTAMP_H */
diff --git a/src/test/recovery/t/040_waitlsn.pl b/src/test/recovery/t/040_waitlsn.pl
new file mode 100644
index 0000000000..dc9e899671
--- /dev/null
+++ b/src/test/recovery/t/040_waitlsn.pl
@@ -0,0 +1,76 @@
+# Checks wait lsn
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Using the backup, create a streaming standby with a 1 second delay
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# Check that timeouts make us wait for the specified time (1s here)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $two_seconds = 2000; # in milliseconds
+my $start_time = time();
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('0/FFFFFFFF', $two_seconds)");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $two_seconds, "wait lsn waits for enough time");
+
+# Check that timeouts let us stop waiting right away, before reaching target LSN
+# Wait for no wait
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my ($ret, $out, $err) = $node_standby->psql('postgres',
+	"SELECT pg_wait_lsn('$lsn1', 1)");
+
+ok($ret == 0, "zero return value when failed to wait lsn on standby");
+ok($err =~ /WARNING:  LSN was not reached/,
+	"correct error message when failed to wait lsn on standby");
+ok($out eq "f", "if given too little wait time, WAIT doesn't reach target LSN");
+
+
+# Check that wait lsn works fine and reaches target LSN if given no timeout
+# Wait for infinite
+
+# Add data on primary, memorize primary's last LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Wait for it to appear on replica, memorize replica's last LSN
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('$lsn2', 0)");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+
+# Make sure that primary's and replica's LSNs are the same after WAIT
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$reached_lsn'::pg_lsn, '$lsn2'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"standby reached the same LSN as primary before starting transaction");
+
+$node_standby->stop;
+$node_primary->stop;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dba3498a13..24dc253c59 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3011,6 +3011,7 @@ WaitEventIPC
 WaitEventSet
 WaitEventTimeout
 WaitPMResult
+WaitState
 WalCloseMethod
 WalCompression
 WalLevel
#12Bowen Shi
zxwsbg12138@gmail.com
In reply to: Картышов Иван (#11)

Hi,

I used the latest code and found some conflicts while applying. Which PG
version did you rebase?

Regards
Bowen Shi

#13Alexander Korotkov
aekorotkov@gmail.com
In reply to: Bowen Shi (#12)

On Thu, Nov 23, 2023 at 5:52 AM Bowen Shi <zxwsbg12138@gmail.com> wrote:

I used the latest code and found some conflicts while applying. Which PG version did you rebase?

I've successfully applied the patch on bc3c8db8ae. But I've used
"patch -p1 < wait_proc_v6.patch", git am doesn't work.

------
Regards,
Alexander Korotkov

#14Alexander Korotkov
aekorotkov@gmail.com
In reply to: Картышов Иван (#11)

On Mon, Nov 20, 2023 at 1:10 PM Картышов Иван
<i.kartyshov@postgrespro.ru> wrote:

Alexander, thank you for your review and pointing this issues. According to
them I made some fixes and rebase all patch.

But I can`t repeat your ERROR. Not with hot_standby_feedback = on nor
hot_standby_feedback = off.

master: create table test as (select i from generate_series(1,10000) i);
slave conn1: select pg_wal_replay_pause();
master: delete from test;
master: vacuum test;
master: select pg_current_wal_lsn();
slave conn2: select pg_wait_lsn('the value from previous query'::pg_lsn, 0);
slave conn1: select pg_wal_replay_resume();
slave conn2: ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.

Also I use little hack to work out of snapshot similar to SnapshotResetXmin.

Patch rebased and ready for review.

I've retried my case with v6 and it doesn't fail anymore. But I
wonder how safe it is to reset xmin within the user-visible function?
We have no guarantee that the function is not called inside the
complex query. Then how will the rest of the query work with xmin
reset? Separate utility statement still looks like more safe option
for me.

------
Regards,
Alexander Korotkov

#15Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Alexander Korotkov (#14)

On 2023-11-27 03:08, Alexander Korotkov wrote:

I've retried my case with v6 and it doesn't fail anymore. But I
wonder how safe it is to reset xmin within the user-visible function?
We have no guarantee that the function is not called inside the
complex query. Then how will the rest of the query work with xmin
reset? Separate utility statement still looks like more safe option
for me.

As you mentioned, we can`t guarantee that the function is not called
inside the complex query, but we can return the xmin after waiting.
But you are right and separate utility statement still looks more safe.
So I want to bring up the discussion on separate utility statement
again.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

#16Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Alexander Korotkov (#14)
3 attachment(s)

Should rise disscusion on separate utility statement or find
case where procedure version is failed.

1) Classic (wait_classic_v3.patch)
/messages/by-id/3cc883048264c2e9af022033925ff8db@postgrespro.ru
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar

WAIT FOR [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp

2) After style: Kyotaro and Freund (wait_after_within_v2.patch)
/messages/by-id/d3ff2e363af60b345f82396992595a03@postgrespro.ru
==========
advantages: no new words in grammar
disadvantages: a little harder to understand

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]

3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch)
/messages/by-id/27171.1586439221@sss.pgh.pa.us
/messages/by-id/20210121.173009.235021120161403875.horikyota.ntt@gmail.com
==========
advantages: no new words in grammar,like it made in
pg_last_wal_replay_lsn
disadvantages: use snapshot xmin trick
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);

Regards
--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

wait_after_within_v2.patchtext/x-diff; name=wait_after_within_v2.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..18695e013e 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..a2794763b1 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a83ff4551e 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..08399452ce 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1780,6 +1781,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			* If we replayed an LSN that someone was waiting for,
+			* set latches in shared memory array to notify the waiter.
+			*/
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+			{
+				 WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..5baa6e2ee3
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,338 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (state->min_lsn.value == lsn_to_delete)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->waited_lsn[i];
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return pg_atomic_read_u64(&state->min_lsn);
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int		res = 0;
+
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												 CStringGetDatum(stmt->lsn)));
+	res = WaitUtility(lsn, stmt->delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 7a1dfb6364..5f3a7edf6a 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -405,7 +405,6 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
 		default:
 
 			/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d631ac89a9..215dabfdb7 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -644,6 +644,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <ival>		wait_time
+%type <node>		wait_for
 
 %type <node>	json_format_clause_opt
 				json_value_expr
@@ -1102,6 +1104,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10911,12 +10914,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11015,12 +11019,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15854,6 +15859,37 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				AFTER LSN_value [WITHIN delay timestamp]
+ *
+ *****************************************************************************/
+WaitStmt:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+		;
+wait_for:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; }
+		;
+
+wait_time:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 
 /*
  * Aggregate decoration clauses
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2225a4a6e6..0364f5d195 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -145,6 +146,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -237,6 +239,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 366a27ae8e..2abe394fad 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2847,6 +2865,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3495,6 +3517,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index e494309da8..cc6ab3b73e 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3524,6 +3524,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node		*wait;			/* wait lsn clause */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4075,4 +4076,16 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		AFTER Statement + AFTER clause of BEGIN statement
+ * ----------------------
+ */
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn;		/* LSN */
+	int			delay;		/* TIMEOUT */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 320ee91512..7fd69e1c7e 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "AFTER", false, false, false)
diff --git a/src/test/recovery/t/040_begin_after.pl b/src/test/recovery/t/040_begin_after.pl
new file mode 100644
index 0000000000..b22e63603c
--- /dev/null
+++ b/src/test/recovery/t/040_begin_after.pl
@@ -0,0 +1,86 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "AFTER" command
+#===============================================================================
+# We need to check that AFTER works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ AFTER '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/041_after.pl b/src/test/recovery/t/041_after.pl
new file mode 100644
index 0000000000..480ba6b33b
--- /dev/null
+++ b/src/test/recovery/t/041_after.pl
@@ -0,0 +1,99 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 2;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"AFTER '$lsn2' WITHIN 1");
+ok($reached_lsn eq "f", "AFTER doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	AFTER '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
wait_classic_v3.patchtext/x-diff; name=wait_classic_v3.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 54b5f22d6e..18695e013e 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..b3af16c09f 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..1b54ed2084 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml
new file mode 100644
index 0000000000..26cae3ad85
--- /dev/null
+++ b/doc/src/sgml/ref/wait.sgml
@@ -0,0 +1,146 @@
+<!--
+doc/src/sgml/ref/wait.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait">
+ <indexterm zone="sql-wait">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for the target <acronym>LSN</acronym> to be replayed or for specified time to pass</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>where <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN value
+    TIMEOUT number_of_milliseconds
+    timestamp
+
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>'
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ALL LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>, TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ANY LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>WAIT FOR</command> provides a simple interprocess communication
+   mechanism to wait for timestamp, timeout or the target log sequence number
+   (<acronym>LSN</acronym>) on standby in <productname>PostgreSQL</productname>
+   databases with master-standby asynchronous replication. When run with the
+   <replaceable>LSN</replaceable> option, the <command>WAIT FOR</command>
+   command waits for the specified <acronym>LSN</acronym> to be replayed.
+   If no timestamp or timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using <literal>Ctrl+C</literal>, or
+   by shutting down the <literal>postgres</literal> server.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">LSN</replaceable></term>
+    <listitem>
+     <para>
+      Specify the target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>TIMEOUT <replaceable class="parameter">wait_timeout</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time interval to wait for the LSN to be replayed.
+      The specified <replaceable>wait_timeout</replaceable> must be an integer
+      and is measured in milliseconds.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>UNTIL TIMESTAMP <replaceable class="parameter">wait_time</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time to wait for the LSN to be replayed.
+      The specified <replaceable>wait_time</replaceable> must be timestamp.
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Run <literal>WAIT FOR</literal> from <application>psql</application>,
+   limiting wait time to 10000 milliseconds:
+
+<screen>
+WAIT FOR LSN '0/3F07A6B1' TIMEOUT 10000;
+NOTICE:  LSN is not reached. Try to increase wait time.
+LSN reached
+-------------
+ f
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Wait until the specified <acronym>LSN</acronym> is replayed:
+<screen>
+WAIT FOR LSN '0/3F07A611';
+LSN reached
+-------------
+ t
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Limit <acronym>LSN</acronym> wait time to 500000 milliseconds,
+   and then cancel the command if <acronym>LSN</acronym> was not reached:
+<screen>
+WAIT FOR LSN '0/3F0FF791' TIMEOUT 500000;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to increase wait time.
+ERROR:  canceling statement due to user request
+ LSN reached
+-------------
+ f
+(1 row)
+</screen>
+</para>
+ </refsect1>
+
+ <refsect1>
+  <title>Compatibility</title>
+
+  <para>
+   There is no <command>WAIT FOR</command> statement in the SQL
+   standard.
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index e11b4b6130..a83ff4551e 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..5cc2196927 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1780,6 +1781,13 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..994f023f64
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,403 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (state->min_lsn.value == lsn_to_delete)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->waited_lsn[i];
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return pg_atomic_read_u64(&state->min_lsn);
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	ListCell   *events;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	float8		delay = 0;
+	float8		final_delay = 0;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	XLogRecPtr	final_lsn = InvalidXLogRecPtr;
+	bool		has_lsn = false;
+	bool		wait_forever = true;
+	int			res = 0;
+
+	if (stmt->strategy == WAIT_FOR_ANY)
+	{
+		/* Prepare to find minimum LSN and delay */
+		final_delay = DBL_MAX;
+		final_lsn = PG_UINT64_MAX;
+	}
+
+	/* Extract options from the statement node tree */
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+
+		/* LSN to wait for */
+		if (event->lsn)
+		{
+			has_lsn = true;
+			lsn = DatumGetLSN(
+						DirectFunctionCall1(pg_lsn_in,
+							CStringGetDatum(event->lsn)));
+
+			/*
+			 * When waiting for ALL, select max LSN to wait for.
+			 * When waiting for ANY, select min LSN to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_lsn <= lsn) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_lsn > lsn))
+			{
+				final_lsn = lsn;
+			}
+		}
+
+		/* Time delay to wait for */
+		if (event->time || event->delay)
+		{
+			wait_forever = false;
+
+			if (event->delay)
+				delay = event->delay / 1000.0;
+
+			if (event->time)
+			{
+				Const *time = (Const *) event->time;
+				delay = WaitTimeResolve(time);
+			}
+
+			/*
+			 * When waiting for ALL, select max delay to wait for.
+			 * When waiting for ANY, select min delay to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_delay <= delay) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_delay > delay))
+			{
+				final_delay = delay;
+			}
+		}
+	}
+	if (wait_forever)
+		final_delay = 0;
+	if (!has_lsn)
+		final_lsn = InvalidXLogRecPtr;
+
+	res = WaitUtility(final_lsn, final_delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 7a1dfb6364..f1fe335bef 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -85,6 +85,7 @@ static Query *transformCreateTableAsStmt(ParseState *pstate,
 										 CreateTableAsStmt *stmt);
 static Query *transformCallStmt(ParseState *pstate,
 								CallStmt *stmt);
+static void transformWaitForStmt(ParseState *pstate, WaitStmt *stmt);
 static void transformLockingClause(ParseState *pstate, Query *qry,
 								   LockingClause *lc, bool pushedDown);
 #ifdef RAW_EXPRESSION_COVERAGE_TEST
@@ -405,7 +406,20 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
+		case T_WaitStmt:
+			transformWaitForStmt(pstate, (WaitStmt *) parseTree);
+			result = makeNode(Query);
+			result->commandType = CMD_UTILITY;
+			result->utilityStmt = (Node *) parseTree;
+			break;
+		case T_TransactionStmt:
+			{
+				TransactionStmt *stmt = (TransactionStmt *) parseTree;
+				if ((stmt->kind == TRANS_STMT_BEGIN ||
+						stmt->kind == TRANS_STMT_START) && stmt->wait)
+					transformWaitForStmt(pstate, (WaitStmt *) stmt->wait);
+			}
+			/* no break here - we want to fall through to the default */
 		default:
 
 			/*
@@ -3559,6 +3573,23 @@ applyLockingClause(Query *qry, Index rtindex,
 	qry->rowMarks = lappend(qry->rowMarks, rc);
 }
 
+/*
+ * transformWaitForStmt -
+ *	transform the WAIT FOR clause of the BEGIN statement
+ *	transform the WAIT FOR statement (TODO: remove this line if we don't keep it)
+ */
+static void
+transformWaitForStmt(ParseState *pstate, WaitStmt *stmt)
+{
+	ListCell   *events;
+
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+		event->time = transformExpr(pstate, event->time, EXPR_KIND_OTHER);
+	}
+}
+
 /*
  * Coverage testing for raw_expression_tree_walker().
  *
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d631ac89a9..b5d86ff32b 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -536,7 +536,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	case_expr case_arg when_clause case_default
 %type <list>	when_clause_list
 %type <node>	opt_search_clause opt_cycle_clause
-%type <ival>	sub_type opt_materialized
+%type <ival>	sub_type wait_strategy opt_materialized
 %type <node>	NumericOnly
 %type <list>	NumericOnly_list
 %type <alias>	alias_clause opt_alias_clause opt_alias_clause_for_join_using
@@ -644,6 +644,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <list>		wait_list
+%type <node>		WaitEvent wait_for
 
 %type <node>	json_format_clause_opt
 				json_value_expr
@@ -729,7 +731,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -763,7 +765,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	SUBSCRIPTION SUBSTRING SUPPORT SYMMETRIC SYSID SYSTEM_P SYSTEM_USER
 
 	TABLE TABLES TABLESAMPLE TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN
-	TIES TIME TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
+	TIES TIME TIMEOUT TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
@@ -773,7 +775,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW
+	WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1102,6 +1105,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10911,12 +10915,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11015,12 +11020,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15854,6 +15860,74 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				WAIT FOR <event> [, <event> ...]
+ *				event is one of:
+ *					LSN value
+ *					TIMEOUT delay
+ *					timestamp
+ *
+ *****************************************************************************/
+WaitStmt:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			;
+wait_for:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; };
+
+wait_strategy:
+			ALL					{ $$ = WAIT_FOR_ALL; }
+			| ANY				{ $$ = WAIT_FOR_ANY; }
+			| /* EMPTY */		{ $$ = WAIT_FOR_ALL; }
+		;
+
+wait_list:
+			WaitEvent					{ $$ = list_make1($1); }
+			| wait_list ',' WaitEvent	{ $$ = lappend($1, $3); }
+			| wait_list WaitEvent		{ $$ = lappend($1, $2); }
+		;
+
+WaitEvent:
+			LSN Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = 0;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| TIMEOUT Iconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = $2;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| ConstDatetime Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = 0;
+					n->time = makeStringConstCast($2, @2, $1);
+					$$ = (Node *)n;
+				}
+			;
+
 
 /*
  * Aggregate decoration clauses
@@ -17239,6 +17313,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -17371,6 +17446,7 @@ unreserved_keyword:
 			| TEMPORARY
 			| TEXT_P
 			| TIES
+			| TIMEOUT
 			| TRANSACTION
 			| TRANSFORM
 			| TRIGGER
@@ -17397,6 +17473,7 @@ unreserved_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -17829,6 +17906,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -17991,6 +18069,7 @@ bare_label_keyword:
 			| THEN
 			| TIES
 			| TIME
+			| TIMEOUT
 			| TIMESTAMP
 			| TRAILING
 			| TRANSACTION
@@ -18028,6 +18107,7 @@ bare_label_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2225a4a6e6..aaa2dc3388 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -145,6 +146,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -345,6 +347,11 @@ CreateOrAttachShmemStructs(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
+
+	/*
+	 * Init array of Latches in shared memory for WAIT
+	 */
+	WaitShmemInit();
 }
 
 /*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 366a27ae8e..2abe394fad 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2847,6 +2865,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3495,6 +3517,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index e494309da8..58e8b14a1b 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3524,6 +3524,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node	   *wait;			/* WAIT clause: list of events to wait for */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4075,4 +4076,26 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		WAIT FOR Statement + WAIT FOR clause of BEGIN statement
+ *		TODO: if we only pick one, remove the other
+ * ----------------------
+ */
+
+typedef enum WaitForStrategy
+{
+	WAIT_FOR_ANY = 0,
+	WAIT_FOR_ALL
+} WaitForStrategy;
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	WaitForStrategy strategy;
+	List	   *events;		/* used as a pointer to the next WAIT event */
+	char	   *lsn;		/* WAIT FOR LSN */
+	int			delay;		/* WAIT FOR TIMEOUT */
+	Node	   *time;		/* WAIT FOR TIMESTAMP or TIME */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 5984dcfa4b..dadcd3b267 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -260,6 +260,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -433,6 +434,7 @@ PG_KEYWORD("text", TEXT_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("then", THEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("ties", TIES, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("time", TIME, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("timeout", TIMEOUT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("timestamp", TIMESTAMP, COL_NAME_KEYWORD, BARE_LABEL)
 PG_KEYWORD("to", TO, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("trailing", TRAILING, RESERVED_KEYWORD, BARE_LABEL)
@@ -473,6 +475,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 320ee91512..7bcfebde38 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT FOR", false, false, false)
diff --git a/src/test/recovery/t/040_begin_wait.pl b/src/test/recovery/t/040_begin_wait.pl
new file mode 100644
index 0000000000..f1e5b5b23d
--- /dev/null
+++ b/src/test/recovery/t/040_begin_wait.pl
@@ -0,0 +1,146 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "WAIT FOR" command
+#===============================================================================
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ WAIT FOR LSN '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/041_wait.pl b/src/test/recovery/t/041_wait.pl
new file mode 100644
index 0000000000..6f9d549416
--- /dev/null
+++ b/src/test/recovery/t/041_wait.pl
@@ -0,0 +1,145 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
wait_proc_v7.patchtext/x-diff; name=wait_proc_v7.patchDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..edbbe208cb 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1780,6 +1781,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..4994e5f4c3
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,286 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements wait lsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2023, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <math.h>
+
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(&state->mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+				state->lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/*
+ * Set latches in shared memory to signal that new LSN has been replayed
+ */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int			backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+
+		if (backend && state->lsn[i] != 0 &&
+			state->lsn[i] <= cur_lsn)
+		{
+			SetLatch(&backend->procLatch);
+		}
+	}
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Get minimal LSN that someone waits for
+ */
+XLogRecPtr
+GetMinWaitedLSN(void)
+{
+	return pg_atomic_read_u64(&state->min_lsn);
+}
+
+/*
+ * On WAIT use a latch to wait till LSN is replayed,
+ * postmaster dies or timeout happens. Timeout is specified in milliseconds.
+ * Returns true if LSN was reached and false otherwise.
+ */
+bool
+WaitUtility(XLogRecPtr target_lsn, const int timeout_ms)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	TransactionId	xmin = MyProc->xmin;
+	int			latch_events;
+	float8		endtime;
+	bool		res = false;
+	bool		wait_forever = (timeout_ms <= 0);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("Work only in standby mode")));
+
+	/*
+	 * In transactions, that have isolation level repeatable read or higher
+	 * wait lsn creates a snapshot if called first in a block, which can
+	 * lead the transaction to working incorrectly
+	 */
+
+	if (IsTransactionBlock() && XactIsoLevel != XACT_READ_COMMITTED) {
+		ereport(WARNING,
+				errmsg("Waitlsn may work incorrectly in this isolation level"),
+				errhint("Call wait lsn before starting the transaction"));
+	}
+
+	endtime = GetNowFloat() + timeout_ms / 1000.0;
+
+	latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	/* Check if we already reached the needed LSN */
+	if (cur_lsn >= target_lsn)
+		return true;
+
+	/* A little hack similar to SnapshotResetXmin to work out of snapshot */
+	MyProc->xmin = InvalidTransactionId;
+
+	AddWaitedLSN(target_lsn);
+
+	for (;;)
+	{
+		int			rc;
+		float8		time_left = 0;
+		long		time_left_ms = 0;
+
+		/* If LSN has been replayed */
+		if (target_lsn <= cur_lsn)
+			break;
+
+		time_left = endtime - GetNowFloat();
+
+		/* Use 100 ms as the default timeout to check for interrupts */
+		if (wait_forever || time_left < 0 || time_left > 0.1)
+			time_left_ms = 100;
+		else
+			time_left_ms = (long) ceil(time_left * 1000.0);
+
+		/* If interrupt, LockErrorCleanup() will do DeleteWaitedLSN() for us */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, time_left_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		ResetLatch(MyLatch);
+
+		if (rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+
+		if (rc & WL_TIMEOUT)
+		{
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+			/* If the time specified by user has passed, stop waiting */
+			time_left = endtime - GetNowFloat();
+			if (!wait_forever && time_left <= 0.0)
+				break;
+		}
+	}
+
+	MyProc->xmin = xmin;
+	DeleteWaitedLSN();
+
+	if (cur_lsn < target_lsn)
+		ereport(WARNING,
+				 errmsg("LSN was not reached"),
+				 errhint("Try to increase wait time."));
+	else
+		res = true;
+
+	return res;
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+	uint64_t		delay = PG_GETARG_INT32(1);
+
+	PG_RETURN_BOOL(WaitUtility(trg_lsn, delay));
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2225a4a6e6..f6d2e5ba21 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -144,6 +145,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, SyncScanShmemSize());
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
+	size = add_size(size, WaitShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
@@ -237,6 +239,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of events for the wait clause in shared memory
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b6451d9d08..d4521e079f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -731,6 +732,9 @@ LockErrorCleanup(void)
 
 	AbortStrongLockAcquire();
 
+	/* If wait lsn was interrupted, then stop waiting for that LSN */
+	DeleteWaitedLSN();
+
 	/* Nothing to do if we weren't waiting for a lock */
 	if (lockAwaited == NULL)
 	{
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 5d78d6dc06..fbc96db198 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -384,8 +384,6 @@ pg_sleep(PG_FUNCTION_ARGS)
 	 * less than the specified time when WaitLatch is terminated early by a
 	 * non-query-canceling signal such as SIGHUP.
 	 */
-#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
-
 	endtime = GetNowFloat() + secs;
 
 	for (;;)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fb58dee3bc..0859ec6682 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12092,6 +12092,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prorettype => 'bool', proargtypes => 'pg_lsn int8',
+  proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..bd52664de9
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2023, Regents of PostgresPro
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+#include "nodes/parsenodes.h"
+
+extern bool WaitUtility(XLogRecPtr lsn, const int timeout_ms);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWaitedLSN(void);
+extern void DeleteWaitedLSN(void);
+
+#endif							/* WAIT_H */
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index c4dd96c8c9..6d763d1175 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -144,4 +144,6 @@ extern int	date2isoyearday(int year, int mon, int mday);
 
 extern bool TimestampTimestampTzRequiresRewrite(void);
 
+#define GetNowFloat() ((float8) GetCurrentTimestamp() / 1000000.0)
+
 #endif							/* TIMESTAMP_H */
diff --git a/src/test/recovery/t/040_waitlsn.pl b/src/test/recovery/t/040_waitlsn.pl
new file mode 100644
index 0000000000..dc9e899671
--- /dev/null
+++ b/src/test/recovery/t/040_waitlsn.pl
@@ -0,0 +1,76 @@
+# Checks wait lsn
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Using the backup, create a streaming standby with a 1 second delay
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+# Check that timeouts make us wait for the specified time (1s here)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $two_seconds = 2000; # in milliseconds
+my $start_time = time();
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('0/FFFFFFFF', $two_seconds)");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $two_seconds, "wait lsn waits for enough time");
+
+# Check that timeouts let us stop waiting right away, before reaching target LSN
+# Wait for no wait
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my ($ret, $out, $err) = $node_standby->psql('postgres',
+	"SELECT pg_wait_lsn('$lsn1', 1)");
+
+ok($ret == 0, "zero return value when failed to wait lsn on standby");
+ok($err =~ /WARNING:  LSN was not reached/,
+	"correct error message when failed to wait lsn on standby");
+ok($out eq "f", "if given too little wait time, WAIT doesn't reach target LSN");
+
+
+# Check that wait lsn works fine and reaches target LSN if given no timeout
+# Wait for infinite
+
+# Add data on primary, memorize primary's last LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Wait for it to appear on replica, memorize replica's last LSN
+$node_standby->safe_psql('postgres',
+	"SELECT pg_wait_lsn('$lsn2', 0)");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+
+# Make sure that primary's and replica's LSNs are the same after WAIT
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$reached_lsn'::pg_lsn, '$lsn2'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"standby reached the same LSN as primary before starting transaction");
+
+$node_standby->stop;
+$node_primary->stop;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d659adbfd6..9cdcebce3c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3015,6 +3015,7 @@ WaitEventIPC
 WaitEventSet
 WaitEventTimeout
 WaitPMResult
+WaitState
 WalCloseMethod
 WalCompression
 WalLevel
#17Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#15)

On Fri, Dec 8, 2023 at 11:20 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

On 2023-11-27 03:08, Alexander Korotkov wrote:

I've retried my case with v6 and it doesn't fail anymore. But I
wonder how safe it is to reset xmin within the user-visible function?
We have no guarantee that the function is not called inside the
complex query. Then how will the rest of the query work with xmin
reset? Separate utility statement still looks like more safe option
for me.

As you mentioned, we can`t guarantee that the function is not called
inside the complex query, but we can return the xmin after waiting.

Returning xmin back isn't safe. Especially after potentially long
waiting. The snapshot could be no longer valid, because the
corresponding tuples could be VACUUM'ed.

------
Regards,
Alexander Korotkov

#18Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#16)

On Fri, Dec 8, 2023 at 11:46 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

Should rise disscusion on separate utility statement or find
case where procedure version is failed.

1) Classic (wait_classic_v3.patch)
/messages/by-id/3cc883048264c2e9af022033925ff8db@postgrespro.ru
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar

WAIT FOR [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp

Nice, but as you stated requires new keywords.

2) After style: Kyotaro and Freund (wait_after_within_v2.patch)
/messages/by-id/d3ff2e363af60b345f82396992595a03@postgrespro.ru
==========
advantages: no new words in grammar
disadvantages: a little harder to understand

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]

+1 from me

3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch)
/messages/by-id/27171.1586439221@sss.pgh.pa.us
/messages/by-id/20210121.173009.235021120161403875.horikyota.ntt@gmail.com
==========
advantages: no new words in grammar,like it made in
pg_last_wal_replay_lsn
disadvantages: use snapshot xmin trick
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);

Nice, because simplicity. But only safe if called within the simple
query containing nothing else. Validating this from the function
kills the simplicity.

------
Regards,
Alexander Korotkov

#19vignesh C
vignesh21@gmail.com
In reply to: Kartyshov Ivan (#16)

On Fri, 8 Dec 2023 at 15:17, Kartyshov Ivan <i.kartyshov@postgrespro.ru> wrote:

Should rise disscusion on separate utility statement or find
case where procedure version is failed.

1) Classic (wait_classic_v3.patch)
/messages/by-id/3cc883048264c2e9af022033925ff8db@postgrespro.ru
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar

WAIT FOR [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp

2) After style: Kyotaro and Freund (wait_after_within_v2.patch)
/messages/by-id/d3ff2e363af60b345f82396992595a03@postgrespro.ru
==========
advantages: no new words in grammar
disadvantages: a little harder to understand

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]

3) Procedure style: Tom Lane and Kyotaro (wait_proc_v7.patch)
/messages/by-id/27171.1586439221@sss.pgh.pa.us
/messages/by-id/20210121.173009.235021120161403875.horikyota.ntt@gmail.com
==========
advantages: no new words in grammar,like it made in
pg_last_wal_replay_lsn
disadvantages: use snapshot xmin trick
SELECT pg_waitlsn(‘LSN’, timeout);
SELECT pg_waitlsn_infinite(‘LSN’);
SELECT pg_waitlsn_no_wait(‘LSN’);

Few of the tests have aborted at [1]https://cirrus-ci.com/task/5618308515364864 in CFBot with:
0000058`9c7ff550 00007ff6`5bdff1f4
postgres!pg_atomic_compare_exchange_u64_impl(
struct pg_atomic_uint64 * ptr = 0x00000000`00000008,
unsigned int64 * expected = 0x00000058`9c7ff5a0,
unsigned int64 newval = 0)+0x34
[c:\cirrus\src\include\port\atomics\generic-msvc.h @ 83]
00000058`9c7ff580 00007ff6`5bdff256 postgres!pg_atomic_read_u64_impl(
struct pg_atomic_uint64 * ptr = 0x00000000`00000008)+0x24
[c:\cirrus\src\include\port\atomics\generic.h @ 323]
00000058`9c7ff5c0 00007ff6`5bdfef67 postgres!pg_atomic_read_u64(
struct pg_atomic_uint64 * ptr = 0x00000000`00000008)+0x46
[c:\cirrus\src\include\port\atomics.h @ 430]
00000058`9c7ff5f0 00007ff6`5bc98fc3
postgres!GetMinWaitedLSN(void)+0x17
[c:\cirrus\src\backend\commands\wait.c @ 176]
00000058`9c7ff620 00007ff6`5bc82fb9
postgres!PerformWalRecovery(void)+0x4c3
[c:\cirrus\src\backend\access\transam\xlogrecovery.c @ 1788]
00000058`9c7ff6e0 00007ff6`5bffc651
postgres!StartupXLOG(void)+0x989
[c:\cirrus\src\backend\access\transam\xlog.c @ 5562]
00000058`9c7ff870 00007ff6`5bfed38b
postgres!StartupProcessMain(void)+0xd1
[c:\cirrus\src\backend\postmaster\startup.c @ 288]
00000058`9c7ff8a0 00007ff6`5bff49fd postgres!AuxiliaryProcessMain(
AuxProcType auxtype = StartupProcess (0n0))+0x1fb
[c:\cirrus\src\backend\postmaster\auxprocess.c @ 139]
00000058`9c7ff8e0 00007ff6`5beb7674 postgres!SubPostmasterMain(

More details are available at [2]https://api.cirrus-ci.com/v1/artifact/task/5618308515364864/crashlog/crashlog-postgres.exe_0008_2023-12-08_07-48-37-722.txt.

[1]: https://cirrus-ci.com/task/5618308515364864
[2]: https://api.cirrus-ci.com/v1/artifact/task/5618308515364864/crashlog/crashlog-postgres.exe_0008_2023-12-08_07-48-37-722.txt

Regards,
Vignesh

#20Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: vignesh C (#19)
2 attachment(s)

Rebased and ready for review.
I left only versions (due to irreparable problems)

1) Classic (wait_classic_v4.patch)
/messages/by-id/3cc883048264c2e9af022033925ff8db@postgrespro.ru
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar

WAIT FOR [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp

2) After style: Kyotaro and Freund (wait_after_within_v3.patch)
/messages/by-id/d3ff2e363af60b345f82396992595a03@postgrespro.ru
==========
advantages: no new words in grammar
disadvantages: a little harder to understand

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

wait_classic_v4.patchtext/x-diff; name=wait_classic_v4.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..410018b79b 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..b3af16c09f 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..1b54ed2084 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml
new file mode 100644
index 0000000000..26cae3ad85
--- /dev/null
+++ b/doc/src/sgml/ref/wait.sgml
@@ -0,0 +1,146 @@
+<!--
+doc/src/sgml/ref/wait.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait">
+ <indexterm zone="sql-wait">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for the target <acronym>LSN</acronym> to be replayed or for specified time to pass</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>where <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN value
+    TIMEOUT number_of_milliseconds
+    timestamp
+
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>'
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ALL LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>, TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ANY LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>WAIT FOR</command> provides a simple interprocess communication
+   mechanism to wait for timestamp, timeout or the target log sequence number
+   (<acronym>LSN</acronym>) on standby in <productname>PostgreSQL</productname>
+   databases with master-standby asynchronous replication. When run with the
+   <replaceable>LSN</replaceable> option, the <command>WAIT FOR</command>
+   command waits for the specified <acronym>LSN</acronym> to be replayed.
+   If no timestamp or timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using <literal>Ctrl+C</literal>, or
+   by shutting down the <literal>postgres</literal> server.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">LSN</replaceable></term>
+    <listitem>
+     <para>
+      Specify the target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>TIMEOUT <replaceable class="parameter">wait_timeout</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time interval to wait for the LSN to be replayed.
+      The specified <replaceable>wait_timeout</replaceable> must be an integer
+      and is measured in milliseconds.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>UNTIL TIMESTAMP <replaceable class="parameter">wait_time</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time to wait for the LSN to be replayed.
+      The specified <replaceable>wait_time</replaceable> must be timestamp.
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Run <literal>WAIT FOR</literal> from <application>psql</application>,
+   limiting wait time to 10000 milliseconds:
+
+<screen>
+WAIT FOR LSN '0/3F07A6B1' TIMEOUT 10000;
+NOTICE:  LSN is not reached. Try to increase wait time.
+LSN reached
+-------------
+ f
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Wait until the specified <acronym>LSN</acronym> is replayed:
+<screen>
+WAIT FOR LSN '0/3F07A611';
+LSN reached
+-------------
+ t
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Limit <acronym>LSN</acronym> wait time to 500000 milliseconds,
+   and then cancel the command if <acronym>LSN</acronym> was not reached:
+<screen>
+WAIT FOR LSN '0/3F0FF791' TIMEOUT 500000;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to increase wait time.
+ERROR:  canceling statement due to user request
+ LSN reached
+-------------
+ f
+(1 row)
+</screen>
+</para>
+ </refsect1>
+
+ <refsect1>
+  <title>Compatibility</title>
+
+  <para>
+   There is no <command>WAIT FOR</command> statement in the SQL
+   standard.
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..138e52c563 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1b48d7171a..d27105b7a3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1784,6 +1785,13 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..3f06dc5341 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..994f023f64
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,403 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (state->min_lsn.value == lsn_to_delete)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->waited_lsn[i];
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return pg_atomic_read_u64(&state->min_lsn);
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	ListCell   *events;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	float8		delay = 0;
+	float8		final_delay = 0;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	XLogRecPtr	final_lsn = InvalidXLogRecPtr;
+	bool		has_lsn = false;
+	bool		wait_forever = true;
+	int			res = 0;
+
+	if (stmt->strategy == WAIT_FOR_ANY)
+	{
+		/* Prepare to find minimum LSN and delay */
+		final_delay = DBL_MAX;
+		final_lsn = PG_UINT64_MAX;
+	}
+
+	/* Extract options from the statement node tree */
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+
+		/* LSN to wait for */
+		if (event->lsn)
+		{
+			has_lsn = true;
+			lsn = DatumGetLSN(
+						DirectFunctionCall1(pg_lsn_in,
+							CStringGetDatum(event->lsn)));
+
+			/*
+			 * When waiting for ALL, select max LSN to wait for.
+			 * When waiting for ANY, select min LSN to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_lsn <= lsn) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_lsn > lsn))
+			{
+				final_lsn = lsn;
+			}
+		}
+
+		/* Time delay to wait for */
+		if (event->time || event->delay)
+		{
+			wait_forever = false;
+
+			if (event->delay)
+				delay = event->delay / 1000.0;
+
+			if (event->time)
+			{
+				Const *time = (Const *) event->time;
+				delay = WaitTimeResolve(time);
+			}
+
+			/*
+			 * When waiting for ALL, select max delay to wait for.
+			 * When waiting for ANY, select min delay to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_delay <= delay) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_delay > delay))
+			{
+				final_delay = delay;
+			}
+		}
+	}
+	if (wait_forever)
+		final_delay = 0;
+	if (!has_lsn)
+		final_lsn = InvalidXLogRecPtr;
+
+	res = WaitUtility(final_lsn, final_delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 06fc8ce98b..3e41596780 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -85,6 +85,7 @@ static Query *transformCreateTableAsStmt(ParseState *pstate,
 										 CreateTableAsStmt *stmt);
 static Query *transformCallStmt(ParseState *pstate,
 								CallStmt *stmt);
+static void transformWaitForStmt(ParseState *pstate, WaitStmt *stmt);
 static void transformLockingClause(ParseState *pstate, Query *qry,
 								   LockingClause *lc, bool pushedDown);
 #ifdef RAW_EXPRESSION_COVERAGE_TEST
@@ -405,7 +406,20 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
+		case T_WaitStmt:
+			transformWaitForStmt(pstate, (WaitStmt *) parseTree);
+			result = makeNode(Query);
+			result->commandType = CMD_UTILITY;
+			result->utilityStmt = (Node *) parseTree;
+			break;
+		case T_TransactionStmt:
+			{
+				TransactionStmt *stmt = (TransactionStmt *) parseTree;
+				if ((stmt->kind == TRANS_STMT_BEGIN ||
+						stmt->kind == TRANS_STMT_START) && stmt->wait)
+					transformWaitForStmt(pstate, (WaitStmt *) stmt->wait);
+			}
+			/* no break here - we want to fall through to the default */
 		default:
 
 			/*
@@ -3559,6 +3573,23 @@ applyLockingClause(Query *qry, Index rtindex,
 	qry->rowMarks = lappend(qry->rowMarks, rc);
 }
 
+/*
+ * transformWaitForStmt -
+ *	transform the WAIT FOR clause of the BEGIN statement
+ *	transform the WAIT FOR statement (TODO: remove this line if we don't keep it)
+ */
+static void
+transformWaitForStmt(ParseState *pstate, WaitStmt *stmt)
+{
+	ListCell   *events;
+
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+		event->time = transformExpr(pstate, event->time, EXPR_KIND_OTHER);
+	}
+}
+
 /*
  * Coverage testing for raw_expression_tree_walker().
  *
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 6b88096e8e..0509cd3146 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -536,7 +536,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	case_expr case_arg when_clause case_default
 %type <list>	when_clause_list
 %type <node>	opt_search_clause opt_cycle_clause
-%type <ival>	sub_type opt_materialized
+%type <ival>	sub_type wait_strategy opt_materialized
 %type <node>	NumericOnly
 %type <list>	NumericOnly_list
 %type <alias>	alias_clause opt_alias_clause opt_alias_clause_for_join_using
@@ -644,6 +644,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <list>		wait_list
+%type <node>		WaitEvent wait_for
 
 %type <node>	json_format_clause
 				json_format_clause_opt
@@ -729,7 +731,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -763,7 +765,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	SUBSCRIPTION SUBSTRING SUPPORT SYMMETRIC SYSID SYSTEM_P SYSTEM_USER
 
 	TABLE TABLES TABLESAMPLE TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN
-	TIES TIME TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
+	TIES TIME TIMEOUT TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
@@ -773,7 +775,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW
+	WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1102,6 +1105,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10921,12 +10925,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11025,12 +11030,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15862,6 +15868,74 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				WAIT FOR <event> [, <event> ...]
+ *				event is one of:
+ *					LSN value
+ *					TIMEOUT delay
+ *					timestamp
+ *
+ *****************************************************************************/
+WaitStmt:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			;
+wait_for:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; };
+
+wait_strategy:
+			ALL					{ $$ = WAIT_FOR_ALL; }
+			| ANY				{ $$ = WAIT_FOR_ANY; }
+			| /* EMPTY */		{ $$ = WAIT_FOR_ALL; }
+		;
+
+wait_list:
+			WaitEvent					{ $$ = list_make1($1); }
+			| wait_list ',' WaitEvent	{ $$ = lappend($1, $3); }
+			| wait_list WaitEvent		{ $$ = lappend($1, $2); }
+		;
+
+WaitEvent:
+			LSN Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = 0;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| TIMEOUT Iconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = $2;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| ConstDatetime Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = 0;
+					n->time = makeStringConstCast($2, @2, $1);
+					$$ = (Node *)n;
+				}
+			;
+
 
 /*
  * Aggregate decoration clauses
@@ -17266,6 +17340,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -17398,6 +17473,7 @@ unreserved_keyword:
 			| TEMPORARY
 			| TEXT_P
 			| TIES
+			| TIMEOUT
 			| TRANSACTION
 			| TRANSFORM
 			| TRIGGER
@@ -17424,6 +17500,7 @@ unreserved_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -17856,6 +17933,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18018,6 +18096,7 @@ bare_label_keyword:
 			| THEN
 			| TIES
 			| TIME
+			| TIMEOUT
 			| TIMESTAMP
 			| TRAILING
 			| TRANSACTION
@@ -18055,6 +18134,7 @@ bare_label_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e5119ed55d..06d165feea 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -26,6 +26,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -149,6 +150,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -351,6 +353,11 @@ CreateOrAttachShmemStructs(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
+
+	/*
+	 * Init array of Latches in shared memory for WAIT
+	 */
+	WaitShmemInit();
 }
 
 /*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 8de821f960..e884828011 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2847,6 +2865,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3495,6 +3517,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b3181f34ae..67ef9eb8ad 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3525,6 +3525,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node	   *wait;			/* WAIT clause: list of events to wait for */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4076,4 +4077,26 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		WAIT FOR Statement + WAIT FOR clause of BEGIN statement
+ *		TODO: if we only pick one, remove the other
+ * ----------------------
+ */
+
+typedef enum WaitForStrategy
+{
+	WAIT_FOR_ANY = 0,
+	WAIT_FOR_ALL
+} WaitForStrategy;
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	WaitForStrategy strategy;
+	List	   *events;		/* used as a pointer to the next WAIT event */
+	char	   *lsn;		/* WAIT FOR LSN */
+	int			delay;		/* WAIT FOR TIMEOUT */
+	Node	   *time;		/* WAIT FOR TIMESTAMP or TIME */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 2331acac09..ae7b65526b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -260,6 +260,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -433,6 +434,7 @@ PG_KEYWORD("text", TEXT_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("then", THEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("ties", TIES, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("time", TIME, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("timeout", TIMEOUT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("timestamp", TIMESTAMP, COL_NAME_KEYWORD, BARE_LABEL)
 PG_KEYWORD("to", TO, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("trailing", TRAILING, RESERVED_KEYWORD, BARE_LABEL)
@@ -473,6 +475,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd9..295cd6ff3a 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT FOR", false, false, false)
diff --git a/src/test/recovery/t/040_begin_wait.pl b/src/test/recovery/t/040_begin_wait.pl
new file mode 100644
index 0000000000..f1e5b5b23d
--- /dev/null
+++ b/src/test/recovery/t/040_begin_wait.pl
@@ -0,0 +1,146 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "WAIT FOR" command
+#===============================================================================
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ WAIT FOR LSN '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/041_wait.pl b/src/test/recovery/t/041_wait.pl
new file mode 100644
index 0000000000..6f9d549416
--- /dev/null
+++ b/src/test/recovery/t/041_wait.pl
@@ -0,0 +1,145 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
wait_after_within_v3.patchtext/x-diff; name=wait_after_within_v3.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index fda4690eab..410018b79b 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..a2794763b1 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index a07d2b5e01..138e52c563 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1b48d7171a..8ab86a83b5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1784,6 +1785,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			* If we replayed an LSN that someone was waiting for,
+			* set latches in shared memory array to notify the waiter.
+			*/
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+			{
+				 WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..3f06dc5341 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..5baa6e2ee3
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,338 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (state->min_lsn.value == lsn_to_delete)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < state->min_lsn.value)
+				state->min_lsn.value = state->waited_lsn[i];
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		state->min_lsn.value = PG_UINT64_MAX;
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return pg_atomic_read_u64(&state->min_lsn);
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int		res = 0;
+
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												 CStringGetDatum(stmt->lsn)));
+	res = WaitUtility(lsn, stmt->delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 06fc8ce98b..f1d86ff435 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -405,7 +405,6 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
 		default:
 
 			/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 6b88096e8e..da090b1285 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -644,6 +644,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <ival>		wait_time
+%type <node>		wait_for
 
 %type <node>	json_format_clause
 				json_format_clause_opt
@@ -1102,6 +1104,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10921,12 +10924,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11025,12 +11029,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15862,6 +15867,37 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				AFTER LSN_value [WITHIN delay timestamp]
+ *
+ *****************************************************************************/
+WaitStmt:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+		;
+wait_for:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; }
+		;
+
+wait_time:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 
 /*
  * Aggregate decoration clauses
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e5119ed55d..ff6a3db6a6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -26,6 +26,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -149,6 +150,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -241,6 +243,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 8de821f960..e884828011 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2847,6 +2865,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3495,6 +3517,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b3181f34ae..98379ae97e 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3525,6 +3525,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node		*wait;			/* wait lsn clause */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4076,4 +4077,16 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		AFTER Statement + AFTER clause of BEGIN statement
+ * ----------------------
+ */
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn;		/* LSN */
+	int			delay;		/* TIMEOUT */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd9..567139963a 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "AFTER", false, false, false)
diff --git a/src/test/recovery/t/040_begin_after.pl b/src/test/recovery/t/040_begin_after.pl
new file mode 100644
index 0000000000..b22e63603c
--- /dev/null
+++ b/src/test/recovery/t/040_begin_after.pl
@@ -0,0 +1,86 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "AFTER" command
+#===============================================================================
+# We need to check that AFTER works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ AFTER '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/041_after.pl b/src/test/recovery/t/041_after.pl
new file mode 100644
index 0000000000..480ba6b33b
--- /dev/null
+++ b/src/test/recovery/t/041_after.pl
@@ -0,0 +1,99 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 2;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"AFTER '$lsn2' WITHIN 1");
+ok($reached_lsn eq "f", "AFTER doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	AFTER '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#21Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Kartyshov Ivan (#20)
2 attachment(s)

Add some fixes and rebase.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

wait_after_within_v4.patchtext/x-diff; name=wait_after_within_v4.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 4a42999b18..657a217e27 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..a2794763b1 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index aa94f6adf6..04e17620dd 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1b48d7171a..8ab86a83b5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1784,6 +1785,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			* If we replayed an LSN that someone was waiting for,
+			* set latches in shared memory array to notify the waiter.
+			*/
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+			{
+				 WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..3f06dc5341 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..880f141cf5
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,338 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < pg_atomic_read_u64(&state->min_lsn))
+		pg_atomic_write_u64(&state->min_lsn,lsn_to_wait);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (pg_atomic_read_u64(&state->min_lsn) == lsn_to_delete)
+	{
+		pg_atomic_write_u64(&state->min_lsn,PG_UINT64_MAX);
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < pg_atomic_read_u64(&state->min_lsn))
+				pg_atomic_write_u64(&state->min_lsn,state->waited_lsn[i]);
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		pg_atomic_init_u64(&state->min_lsn,PG_UINT64_MAX);
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return pg_atomic_read_u64(&state->min_lsn);
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int		res = 0;
+
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												 CStringGetDatum(stmt->lsn)));
+	res = WaitUtility(lsn, stmt->delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 06fc8ce98b..f1d86ff435 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -405,7 +405,6 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
 		default:
 
 			/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3460fea56b..4587d5a23d 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -645,6 +645,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <ival>		wait_time
+%type <node>		wait_for
 
 %type <node>	json_format_clause
 				json_format_clause_opt
@@ -1103,6 +1105,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10927,12 +10930,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11031,12 +11035,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15868,6 +15873,37 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				AFTER LSN_value [WITHIN delay timestamp]
+ *
+ *****************************************************************************/
+WaitStmt:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+		;
+wait_for:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; }
+		;
+
+wait_time:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 
 /*
  * Aggregate decoration clauses
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e5119ed55d..ff6a3db6a6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -26,6 +26,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -149,6 +150,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -241,6 +243,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 8de821f960..e884828011 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2847,6 +2865,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3495,6 +3517,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b3181f34ae..98379ae97e 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3525,6 +3525,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node		*wait;			/* wait lsn clause */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4076,4 +4077,16 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		AFTER Statement + AFTER clause of BEGIN statement
+ * ----------------------
+ */
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn;		/* LSN */
+	int			delay;		/* TIMEOUT */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd9..567139963a 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "AFTER", false, false, false)
diff --git a/src/test/recovery/t/040_begin_after.pl b/src/test/recovery/t/040_begin_after.pl
new file mode 100644
index 0000000000..b22e63603c
--- /dev/null
+++ b/src/test/recovery/t/040_begin_after.pl
@@ -0,0 +1,86 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "AFTER" command
+#===============================================================================
+# We need to check that AFTER works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ AFTER '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/041_after.pl b/src/test/recovery/t/041_after.pl
new file mode 100644
index 0000000000..480ba6b33b
--- /dev/null
+++ b/src/test/recovery/t/041_after.pl
@@ -0,0 +1,99 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 2;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"AFTER '$lsn2' WITHIN 1");
+ok($reached_lsn eq "f", "AFTER doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	AFTER '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
wait_classic_v5.patchtext/x-diff; name=wait_classic_v5.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 4a42999b18..657a217e27 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..b3af16c09f 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..1b54ed2084 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml
new file mode 100644
index 0000000000..26cae3ad85
--- /dev/null
+++ b/doc/src/sgml/ref/wait.sgml
@@ -0,0 +1,146 @@
+<!--
+doc/src/sgml/ref/wait.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait">
+ <indexterm zone="sql-wait">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for the target <acronym>LSN</acronym> to be replayed or for specified time to pass</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>where <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN value
+    TIMEOUT number_of_milliseconds
+    timestamp
+
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>'
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ALL LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>, TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ANY LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>WAIT FOR</command> provides a simple interprocess communication
+   mechanism to wait for timestamp, timeout or the target log sequence number
+   (<acronym>LSN</acronym>) on standby in <productname>PostgreSQL</productname>
+   databases with master-standby asynchronous replication. When run with the
+   <replaceable>LSN</replaceable> option, the <command>WAIT FOR</command>
+   command waits for the specified <acronym>LSN</acronym> to be replayed.
+   If no timestamp or timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using <literal>Ctrl+C</literal>, or
+   by shutting down the <literal>postgres</literal> server.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">LSN</replaceable></term>
+    <listitem>
+     <para>
+      Specify the target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>TIMEOUT <replaceable class="parameter">wait_timeout</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time interval to wait for the LSN to be replayed.
+      The specified <replaceable>wait_timeout</replaceable> must be an integer
+      and is measured in milliseconds.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>UNTIL TIMESTAMP <replaceable class="parameter">wait_time</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time to wait for the LSN to be replayed.
+      The specified <replaceable>wait_time</replaceable> must be timestamp.
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Run <literal>WAIT FOR</literal> from <application>psql</application>,
+   limiting wait time to 10000 milliseconds:
+
+<screen>
+WAIT FOR LSN '0/3F07A6B1' TIMEOUT 10000;
+NOTICE:  LSN is not reached. Try to increase wait time.
+LSN reached
+-------------
+ f
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Wait until the specified <acronym>LSN</acronym> is replayed:
+<screen>
+WAIT FOR LSN '0/3F07A611';
+LSN reached
+-------------
+ t
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Limit <acronym>LSN</acronym> wait time to 500000 milliseconds,
+   and then cancel the command if <acronym>LSN</acronym> was not reached:
+<screen>
+WAIT FOR LSN '0/3F0FF791' TIMEOUT 500000;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to increase wait time.
+ERROR:  canceling statement due to user request
+ LSN reached
+-------------
+ f
+(1 row)
+</screen>
+</para>
+ </refsect1>
+
+ <refsect1>
+  <title>Compatibility</title>
+
+  <para>
+   There is no <command>WAIT FOR</command> statement in the SQL
+   standard.
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index aa94f6adf6..04e17620dd 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1b48d7171a..d27105b7a3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1784,6 +1785,13 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..3f06dc5341 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..83f069b8dc
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,403 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < pg_atomic_read_u64(&state->min_lsn))
+		pg_atomic_write_u64(&state->min_lsn,lsn_to_wait);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (pg_atomic_read_u64(&state->min_lsn) == lsn_to_delete)
+	{
+		pg_atomic_write_u64(&state->min_lsn,PG_UINT64_MAX);
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < pg_atomic_read_u64(&state->min_lsn))
+				pg_atomic_write_u64(&state->min_lsn,state->waited_lsn[i]);
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		pg_atomic_init_u64(&state->min_lsn,PG_UINT64_MAX);
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return pg_atomic_read_u64(&state->min_lsn);
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	ListCell   *events;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	float8		delay = 0;
+	float8		final_delay = 0;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	XLogRecPtr	final_lsn = InvalidXLogRecPtr;
+	bool		has_lsn = false;
+	bool		wait_forever = true;
+	int			res = 0;
+
+	if (stmt->strategy == WAIT_FOR_ANY)
+	{
+		/* Prepare to find minimum LSN and delay */
+		final_delay = DBL_MAX;
+		final_lsn = PG_UINT64_MAX;
+	}
+
+	/* Extract options from the statement node tree */
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+
+		/* LSN to wait for */
+		if (event->lsn)
+		{
+			has_lsn = true;
+			lsn = DatumGetLSN(
+						DirectFunctionCall1(pg_lsn_in,
+							CStringGetDatum(event->lsn)));
+
+			/*
+			 * When waiting for ALL, select max LSN to wait for.
+			 * When waiting for ANY, select min LSN to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_lsn <= lsn) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_lsn > lsn))
+			{
+				final_lsn = lsn;
+			}
+		}
+
+		/* Time delay to wait for */
+		if (event->time || event->delay)
+		{
+			wait_forever = false;
+
+			if (event->delay)
+				delay = event->delay / 1000.0;
+
+			if (event->time)
+			{
+				Const *time = (Const *) event->time;
+				delay = WaitTimeResolve(time);
+			}
+
+			/*
+			 * When waiting for ALL, select max delay to wait for.
+			 * When waiting for ANY, select min delay to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_delay <= delay) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_delay > delay))
+			{
+				final_delay = delay;
+			}
+		}
+	}
+	if (wait_forever)
+		final_delay = 0;
+	if (!has_lsn)
+		final_lsn = InvalidXLogRecPtr;
+
+	res = WaitUtility(final_lsn, final_delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 06fc8ce98b..51de487431 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -85,6 +85,7 @@ static Query *transformCreateTableAsStmt(ParseState *pstate,
 										 CreateTableAsStmt *stmt);
 static Query *transformCallStmt(ParseState *pstate,
 								CallStmt *stmt);
+static void transformWaitForStmt(ParseState *pstate, WaitStmt *stmt);
 static void transformLockingClause(ParseState *pstate, Query *qry,
 								   LockingClause *lc, bool pushedDown);
 #ifdef RAW_EXPRESSION_COVERAGE_TEST
@@ -405,7 +406,21 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
+		case T_WaitStmt:
+			transformWaitForStmt(pstate, (WaitStmt *) parseTree);
+			result = makeNode(Query);
+			result->commandType = CMD_UTILITY;
+			result->utilityStmt = (Node *) parseTree;
+			break;
+		case T_TransactionStmt:
+			{
+				TransactionStmt *stmt = (TransactionStmt *) parseTree;
+				if ((stmt->kind == TRANS_STMT_BEGIN ||
+						stmt->kind == TRANS_STMT_START) && stmt->wait)
+					transformWaitForStmt(pstate, (WaitStmt *) stmt->wait);
+			}
+			/* no break here - we want to fall through to the default */
+			/* FALLTHROUGH */
 		default:
 
 			/*
@@ -3559,6 +3574,23 @@ applyLockingClause(Query *qry, Index rtindex,
 	qry->rowMarks = lappend(qry->rowMarks, rc);
 }
 
+/*
+ * transformWaitForStmt -
+ *	transform the WAIT FOR clause of the BEGIN statement
+ *	transform the WAIT FOR statement (TODO: remove this line if we don't keep it)
+ */
+static void
+transformWaitForStmt(ParseState *pstate, WaitStmt *stmt)
+{
+	ListCell   *events;
+
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+		event->time = transformExpr(pstate, event->time, EXPR_KIND_OTHER);
+	}
+}
+
 /*
  * Coverage testing for raw_expression_tree_walker().
  *
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3460fea56b..343e5ba856 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -537,7 +537,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	case_expr case_arg when_clause case_default
 %type <list>	when_clause_list
 %type <node>	opt_search_clause opt_cycle_clause
-%type <ival>	sub_type opt_materialized
+%type <ival>	sub_type wait_strategy opt_materialized
 %type <node>	NumericOnly
 %type <list>	NumericOnly_list
 %type <alias>	alias_clause opt_alias_clause opt_alias_clause_for_join_using
@@ -645,6 +645,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <list>		wait_list
+%type <node>		WaitEvent wait_for
 
 %type <node>	json_format_clause
 				json_format_clause_opt
@@ -730,7 +732,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -764,7 +766,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	SUBSCRIPTION SUBSTRING SUPPORT SYMMETRIC SYSID SYSTEM_P SYSTEM_USER
 
 	TABLE TABLES TABLESAMPLE TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN
-	TIES TIME TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
+	TIES TIME TIMEOUT TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
@@ -774,7 +776,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW
+	WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1103,6 +1106,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10927,12 +10931,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11031,12 +11036,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15868,6 +15874,74 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				WAIT FOR <event> [, <event> ...]
+ *				event is one of:
+ *					LSN value
+ *					TIMEOUT delay
+ *					timestamp
+ *
+ *****************************************************************************/
+WaitStmt:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			;
+wait_for:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; };
+
+wait_strategy:
+			ALL					{ $$ = WAIT_FOR_ALL; }
+			| ANY				{ $$ = WAIT_FOR_ANY; }
+			| /* EMPTY */		{ $$ = WAIT_FOR_ALL; }
+		;
+
+wait_list:
+			WaitEvent					{ $$ = list_make1($1); }
+			| wait_list ',' WaitEvent	{ $$ = lappend($1, $3); }
+			| wait_list WaitEvent		{ $$ = lappend($1, $2); }
+		;
+
+WaitEvent:
+			LSN Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = 0;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| TIMEOUT Iconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = $2;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| ConstDatetime Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = 0;
+					n->time = makeStringConstCast($2, @2, $1);
+					$$ = (Node *)n;
+				}
+			;
+
 
 /*
  * Aggregate decoration clauses
@@ -17272,6 +17346,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -17404,6 +17479,7 @@ unreserved_keyword:
 			| TEMPORARY
 			| TEXT_P
 			| TIES
+			| TIMEOUT
 			| TRANSACTION
 			| TRANSFORM
 			| TRIGGER
@@ -17430,6 +17506,7 @@ unreserved_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -17862,6 +17939,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18024,6 +18102,7 @@ bare_label_keyword:
 			| THEN
 			| TIES
 			| TIME
+			| TIMEOUT
 			| TIMESTAMP
 			| TRAILING
 			| TRANSACTION
@@ -18061,6 +18140,7 @@ bare_label_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e5119ed55d..06d165feea 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -26,6 +26,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -149,6 +150,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, AsyncShmemSize());
 	size = add_size(size, StatsShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -351,6 +353,11 @@ CreateOrAttachShmemStructs(void)
 	AsyncShmemInit();
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
+
+	/*
+	 * Init array of Latches in shared memory for WAIT
+	 */
+	WaitShmemInit();
 }
 
 /*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 8de821f960..e884828011 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2847,6 +2865,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3495,6 +3517,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b3181f34ae..67ef9eb8ad 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3525,6 +3525,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node	   *wait;			/* WAIT clause: list of events to wait for */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4076,4 +4077,26 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		WAIT FOR Statement + WAIT FOR clause of BEGIN statement
+ *		TODO: if we only pick one, remove the other
+ * ----------------------
+ */
+
+typedef enum WaitForStrategy
+{
+	WAIT_FOR_ANY = 0,
+	WAIT_FOR_ALL
+} WaitForStrategy;
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	WaitForStrategy strategy;
+	List	   *events;		/* used as a pointer to the next WAIT event */
+	char	   *lsn;		/* WAIT FOR LSN */
+	int			delay;		/* WAIT FOR TIMEOUT */
+	Node	   *time;		/* WAIT FOR TIMESTAMP or TIME */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 2331acac09..ae7b65526b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -260,6 +260,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -433,6 +434,7 @@ PG_KEYWORD("text", TEXT_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("then", THEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("ties", TIES, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("time", TIME, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("timeout", TIMEOUT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("timestamp", TIMESTAMP, COL_NAME_KEYWORD, BARE_LABEL)
 PG_KEYWORD("to", TO, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("trailing", TRAILING, RESERVED_KEYWORD, BARE_LABEL)
@@ -473,6 +475,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd9..295cd6ff3a 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT FOR", false, false, false)
diff --git a/src/test/recovery/t/040_begin_wait.pl b/src/test/recovery/t/040_begin_wait.pl
new file mode 100644
index 0000000000..f1e5b5b23d
--- /dev/null
+++ b/src/test/recovery/t/040_begin_wait.pl
@@ -0,0 +1,146 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "WAIT FOR" command
+#===============================================================================
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ WAIT FOR LSN '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/041_wait.pl b/src/test/recovery/t/041_wait.pl
new file mode 100644
index 0000000000..6f9d549416
--- /dev/null
+++ b/src/test/recovery/t/041_wait.pl
@@ -0,0 +1,145 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#22Peter Smith
smithpb2250@gmail.com
In reply to: Kartyshov Ivan (#21)

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4221/, but it seems
there was a CFbot test failure last time it was run [2]https://cirrus-ci.com/task/5618308515364864. Please have a
look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/4221/
[2]: https://cirrus-ci.com/task/5618308515364864

Kind Regards,
Peter Smith.

#23Dilip Kumar
dilipbalaut@gmail.com
In reply to: Kartyshov Ivan (#21)

On Wed, Jan 17, 2024 at 1:46 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

Add some fixes and rebase.

While quickly looking into the patch, I understood the idea of what we
are trying to achieve here and I feel that a useful feature. But
while looking at both the patches I could not quickly differentiate
between these two approaches. I believe, internally at the core both
are implementing similar wait logic but providing different syntaxes.
So if we want to keep both these approaches open for the sake of
discussion then better first to create a patch that implements the
core approach i.e. the waiting logic and the other common part and
then add top-up patches with 2 different approaches that would be easy
for review. I also see in v4 that there is no documentation for the
syntax part so it makes it even harder to understand.

I think this thread is implementing a useful feature so my suggestion
is to add some documentation in v4 and also make it more readable
w.r.t. What are the clear differences between these two approaches,
maybe adding commit message will also help.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#24Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Kartyshov Ivan (#16)
2 attachment(s)
Fwd: Re: [HACKERS] make async slave to wait for lsn to be replayed

Updated, rebased, fixed Ci and added documentation.
We left two different solutions. Help me please to choose the best.

1) Classic (wait_classic_v6.patch)
/messages/by-id/3cc883048264c2e9af022033925ff8db@postgrespro.ru
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar

WAIT FOR [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp

2) After style: Kyotaro and Freund (wait_after_within_v5.patch)
/messages/by-id/d3ff2e363af60b345f82396992595a03@postgrespro.ru
==========
advantages: no new words in grammar
disadvantages: a little harder to understand, fewer events to wait

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]

Regards

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

wait_after_within_v5.patchtext/x-diff; name=wait_after_within_v5.patchDownload
diff --git a/doc/src/sgml/ref/after.sgml b/doc/src/sgml/ref/after.sgml
new file mode 100644
index 0000000000..def0ec6be3
--- /dev/null
+++ b/doc/src/sgml/ref/after.sgml
@@ -0,0 +1,118 @@
+<!--
+doc/src/sgml/ref/after.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait">
+ <indexterm zone="sql-wait">
+  <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>AFTER</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>AFTER</refname>
+  <refpurpose>AFTER the target <acronym>LSN</acronym> to be replayed and for specified time to timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+AFTER <replaceable class="parameter">LSN</replaceable> [WITHIN timeout_in_milliseconds]
+
+AFTER LSN '<replaceable class="parameter">lsn_number</replaceable>'
+AFTER LSN '<replaceable class="parameter">lsn_number</replaceable>' WITHIN <replaceable class="parameter">wait_timeout</replaceable>
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>AFTER</command> provides a simple interprocess communication
+   mechanism to wait for the target log sequence number 
+   (<acronym>LSN</acronym>) on standby in <productname>PostgreSQL</productname>
+   databases with master-standby asynchronous replication. <command>AFTER</command>
+   command waits for the specified <acronym>LSN</acronym> to be replayed.
+   If no timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using <literal>Ctrl+C</literal>, or
+   by shutting down the <literal>postgres</literal> server.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">LSN</replaceable></term>
+    <listitem>
+     <para>
+      Specify the target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>TIMEOUT <replaceable class="parameter">wait_timeout</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time interval to AFTER the LSN to be replayed.
+      The specified <replaceable>wait_timeout</replaceable> must be an integer
+      and is measured in milliseconds.
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Run <literal>AFTER</literal> from <application>psql</application>,
+   limiting wait time to 10000 milliseconds:
+
+<screen>
+AFTER '0/3F07A6B1' WITHIN 10000;
+NOTICE:  LSN is not reached. Try to increase wait time.
+LSN reached
+-------------
+ f
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Wait until the specified <acronym>LSN</acronym> is replayed:
+<screen>
+AFTER '0/3F07A611';
+LSN reached
+-------------
+ t
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Limit <acronym>LSN</acronym> wait time to 500000 milliseconds,
+   and then cancel the command if <acronym>LSN</acronym> was not reached:
+<screen>
+AFTER LSN '0/3F0FF791' WITHIN 500000;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to increase wait time.
+ERROR:  canceling statement due to user request
+ LSN reached
+-------------
+ f
+(1 row)
+</screen>
+</para>
+</refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 4a42999b18..ff2e6e0f3f 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -6,6 +6,7 @@ Complete list of usable sgml source files in this directory.
 
 <!-- SQL commands -->
 <!ENTITY abort              SYSTEM "abort.sgml">
+<!ENTITY abort              SYSTEM "after.sgml">
 <!ENTITY alterAggregate     SYSTEM "alter_aggregate.sgml">
 <!ENTITY alterCollation     SYSTEM "alter_collation.sgml">
 <!ENTITY alterConversion    SYSTEM "alter_conversion.sgml">
@@ -188,6 +189,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..a2794763b1 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index aa94f6adf6..7c94cf9de1 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -34,6 +34,7 @@
   </partintro>
 
    &abort;
+   &after;
    &alterAggregate;
    &alterCollation;
    &alterConversion;
@@ -216,6 +217,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bb472da27..4d39a24c61 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1810,6 +1811,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			* If we replayed an LSN that someone was waiting for,
+			* set latches in shared memory array to notify the waiter.
+			*/
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+			{
+				 WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..3f06dc5341 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..da16d72d62
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,298 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < pg_atomic_read_u64(&state->min_lsn))
+		pg_atomic_write_u64(&state->min_lsn,lsn_to_wait);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (pg_atomic_read_u64(&state->min_lsn) == lsn_to_delete)
+	{
+		pg_atomic_write_u64(&state->min_lsn,PG_UINT64_MAX);
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < pg_atomic_read_u64(&state->min_lsn))
+				pg_atomic_write_u64(&state->min_lsn,state->waited_lsn[i]);
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		pg_atomic_init_u64(&state->min_lsn,PG_UINT64_MAX);
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return state ? pg_atomic_read_u64(&state->min_lsn) : PG_UINT64_MAX;
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint32		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int		res = 0;
+
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												 CStringGetDatum(stmt->lsn)));
+	res = WaitUtility(lsn, stmt->delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index dbdf6bf896..a25b7cd2cf 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -405,7 +405,6 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
 		default:
 
 			/*
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 130f7fc7c3..37c425dbe4 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -645,6 +645,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <ival>		wait_time
+%type <node>		wait_for
 
 %type <node>	json_format_clause
 				json_format_clause_opt
@@ -1103,6 +1105,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10934,12 +10937,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11038,12 +11042,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15875,6 +15880,37 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				AFTER LSN_value [WITHIN delay timestamp]
+ *
+ *****************************************************************************/
+WaitStmt:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+		;
+wait_for:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; }
+		;
+
+wait_time:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 
 /*
  * Aggregate decoration clauses
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7084e18861..9df6197df0 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -26,6 +26,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -153,6 +154,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, StatsShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -245,6 +247,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 8de821f960..e884828011 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2847,6 +2865,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3495,6 +3517,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 476d55dd24..753f3c4fb3 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3527,6 +3527,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node		*wait;			/* wait lsn clause */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4078,4 +4079,16 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		AFTER Statement + AFTER clause of BEGIN statement
+ * ----------------------
+ */
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn;		/* LSN */
+	int			delay;		/* TIMEOUT */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd9..567139963a 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "AFTER", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index bf087ac2a9..47e36ab6f2 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -46,6 +46,8 @@ tests += {
       't/038_save_logical_slots_shutdown.pl',
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
+      't/041_begin_after.pl',
+      't/042_after.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/040_begin_after.pl b/src/test/recovery/t/040_begin_after.pl
new file mode 100644
index 0000000000..b22e63603c
--- /dev/null
+++ b/src/test/recovery/t/040_begin_after.pl
@@ -0,0 +1,86 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "AFTER" command
+#===============================================================================
+# We need to check that AFTER works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ AFTER '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/041_after.pl b/src/test/recovery/t/041_after.pl
new file mode 100644
index 0000000000..480ba6b33b
--- /dev/null
+++ b/src/test/recovery/t/041_after.pl
@@ -0,0 +1,99 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 2;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary AFTER");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"AFTER '$lsn2' WITHIN 1");
+ok($reached_lsn eq "f", "AFTER doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	AFTER '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
wait_classic_v6.patchtext/x-diff; name=wait_classic_v6.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 4a42999b18..657a217e27 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..b3af16c09f 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..1b54ed2084 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml
new file mode 100644
index 0000000000..26cae3ad85
--- /dev/null
+++ b/doc/src/sgml/ref/wait.sgml
@@ -0,0 +1,146 @@
+<!--
+doc/src/sgml/ref/wait.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait">
+ <indexterm zone="sql-wait">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for the target <acronym>LSN</acronym> to be replayed or for specified time to pass</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>where <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN value
+    TIMEOUT number_of_milliseconds
+    timestamp
+
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>'
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ALL LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>, TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ANY LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>WAIT FOR</command> provides a simple interprocess communication
+   mechanism to wait for timestamp, timeout or the target log sequence number
+   (<acronym>LSN</acronym>) on standby in <productname>PostgreSQL</productname>
+   databases with master-standby asynchronous replication. When run with the
+   <replaceable>LSN</replaceable> option, the <command>WAIT FOR</command>
+   command waits for the specified <acronym>LSN</acronym> to be replayed.
+   If no timestamp or timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using <literal>Ctrl+C</literal>, or
+   by shutting down the <literal>postgres</literal> server.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">LSN</replaceable></term>
+    <listitem>
+     <para>
+      Specify the target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>TIMEOUT <replaceable class="parameter">wait_timeout</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time interval to wait for the LSN to be replayed.
+      The specified <replaceable>wait_timeout</replaceable> must be an integer
+      and is measured in milliseconds.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>UNTIL TIMESTAMP <replaceable class="parameter">wait_time</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time to wait for the LSN to be replayed.
+      The specified <replaceable>wait_time</replaceable> must be timestamp.
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Run <literal>WAIT FOR</literal> from <application>psql</application>,
+   limiting wait time to 10000 milliseconds:
+
+<screen>
+WAIT FOR LSN '0/3F07A6B1' TIMEOUT 10000;
+NOTICE:  LSN is not reached. Try to increase wait time.
+LSN reached
+-------------
+ f
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Wait until the specified <acronym>LSN</acronym> is replayed:
+<screen>
+WAIT FOR LSN '0/3F07A611';
+LSN reached
+-------------
+ t
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Limit <acronym>LSN</acronym> wait time to 500000 milliseconds,
+   and then cancel the command if <acronym>LSN</acronym> was not reached:
+<screen>
+WAIT FOR LSN '0/3F0FF791' TIMEOUT 500000;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to increase wait time.
+ERROR:  canceling statement due to user request
+ LSN reached
+-------------
+ f
+(1 row)
+</screen>
+</para>
+ </refsect1>
+
+ <refsect1>
+  <title>Compatibility</title>
+
+  <para>
+   There is no <command>WAIT FOR</command> statement in the SQL
+   standard.
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index aa94f6adf6..04e17620dd 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0bb472da27..9415ec1f1e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1810,6 +1811,13 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..3f06dc5341 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..e74eaeb2bc
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,403 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2020, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/backendid.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->waited_lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < pg_atomic_read_u64(&state->min_lsn))
+		pg_atomic_write_u64(&state->min_lsn,lsn_to_wait);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyBackendId];
+
+	state->waited_lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (pg_atomic_read_u64(&state->min_lsn) == lsn_to_delete)
+	{
+		pg_atomic_write_u64(&state->min_lsn,PG_UINT64_MAX);
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < pg_atomic_read_u64(&state->min_lsn))
+				pg_atomic_write_u64(&state->min_lsn,state->waited_lsn[i]);
+	}
+
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < (MaxBackends + 1); i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		pg_atomic_init_u64(&state->min_lsn,PG_UINT64_MAX);
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 2; i <= backend_maxid; i++)
+	{
+		backend = BackendIdGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return state ? pg_atomic_read_u64(&state->min_lsn) : PG_UINT64_MAX;
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint32		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	ListCell   *events;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	float8		delay = 0;
+	float8		final_delay = 0;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	XLogRecPtr	final_lsn = InvalidXLogRecPtr;
+	bool		has_lsn = false;
+	bool		wait_forever = true;
+	int			res = 0;
+
+	if (stmt->strategy == WAIT_FOR_ANY)
+	{
+		/* Prepare to find minimum LSN and delay */
+		final_delay = DBL_MAX;
+		final_lsn = PG_UINT64_MAX;
+	}
+
+	/* Extract options from the statement node tree */
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+
+		/* LSN to wait for */
+		if (event->lsn)
+		{
+			has_lsn = true;
+			lsn = DatumGetLSN(
+						DirectFunctionCall1(pg_lsn_in,
+							CStringGetDatum(event->lsn)));
+
+			/*
+			 * When waiting for ALL, select max LSN to wait for.
+			 * When waiting for ANY, select min LSN to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_lsn <= lsn) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_lsn > lsn))
+			{
+				final_lsn = lsn;
+			}
+		}
+
+		/* Time delay to wait for */
+		if (event->time || event->delay)
+		{
+			wait_forever = false;
+
+			if (event->delay)
+				delay = event->delay / 1000.0;
+
+			if (event->time)
+			{
+				Const *time = (Const *) event->time;
+				delay = WaitTimeResolve(time);
+			}
+
+			/*
+			 * When waiting for ALL, select max delay to wait for.
+			 * When waiting for ANY, select min delay to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_delay <= delay) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_delay > delay))
+			{
+				final_delay = delay;
+			}
+		}
+	}
+	if (wait_forever)
+		final_delay = 0;
+	if (!has_lsn)
+		final_lsn = InvalidXLogRecPtr;
+
+	res = WaitUtility(final_lsn, final_delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index dbdf6bf896..a3b3047018 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -85,6 +85,7 @@ static Query *transformCreateTableAsStmt(ParseState *pstate,
 										 CreateTableAsStmt *stmt);
 static Query *transformCallStmt(ParseState *pstate,
 								CallStmt *stmt);
+static void transformWaitForStmt(ParseState *pstate, WaitStmt *stmt);
 static void transformLockingClause(ParseState *pstate, Query *qry,
 								   LockingClause *lc, bool pushedDown);
 #ifdef RAW_EXPRESSION_COVERAGE_TEST
@@ -405,7 +406,21 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
+		case T_WaitStmt:
+			transformWaitForStmt(pstate, (WaitStmt *) parseTree);
+			result = makeNode(Query);
+			result->commandType = CMD_UTILITY;
+			result->utilityStmt = (Node *) parseTree;
+			break;
+		case T_TransactionStmt:
+			{
+				TransactionStmt *stmt = (TransactionStmt *) parseTree;
+				if ((stmt->kind == TRANS_STMT_BEGIN ||
+						stmt->kind == TRANS_STMT_START) && stmt->wait)
+					transformWaitForStmt(pstate, (WaitStmt *) stmt->wait);
+			}
+			/* no break here - we want to fall through to the default */
+			/* FALLTHROUGH */
 		default:
 
 			/*
@@ -3562,6 +3577,23 @@ applyLockingClause(Query *qry, Index rtindex,
 	qry->rowMarks = lappend(qry->rowMarks, rc);
 }
 
+/*
+ * transformWaitForStmt -
+ *	transform the WAIT FOR clause of the BEGIN statement
+ *	transform the WAIT FOR statement (TODO: remove this line if we don't keep it)
+ */
+static void
+transformWaitForStmt(ParseState *pstate, WaitStmt *stmt)
+{
+	ListCell   *events;
+
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+		event->time = transformExpr(pstate, event->time, EXPR_KIND_OTHER);
+	}
+}
+
 /*
  * Coverage testing for raw_expression_tree_walker().
  *
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 130f7fc7c3..dfdf6d2d78 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -537,7 +537,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	case_expr case_arg when_clause case_default
 %type <list>	when_clause_list
 %type <node>	opt_search_clause opt_cycle_clause
-%type <ival>	sub_type opt_materialized
+%type <ival>	sub_type wait_strategy opt_materialized
 %type <node>	NumericOnly
 %type <list>	NumericOnly_list
 %type <alias>	alias_clause opt_alias_clause opt_alias_clause_for_join_using
@@ -645,6 +645,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <list>		wait_list
+%type <node>		WaitEvent wait_for
 
 %type <node>	json_format_clause
 				json_format_clause_opt
@@ -730,7 +732,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -764,7 +766,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	SUBSCRIPTION SUBSTRING SUPPORT SYMMETRIC SYSID SYSTEM_P SYSTEM_USER
 
 	TABLE TABLES TABLESAMPLE TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN
-	TIES TIME TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
+	TIES TIME TIMEOUT TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
@@ -774,7 +776,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW
+	WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1103,6 +1106,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10934,12 +10938,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11038,12 +11043,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15875,6 +15881,74 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				WAIT FOR <event> [, <event> ...]
+ *				event is one of:
+ *					LSN value
+ *					TIMEOUT delay
+ *					timestamp
+ *
+ *****************************************************************************/
+WaitStmt:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			;
+wait_for:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; };
+
+wait_strategy:
+			ALL					{ $$ = WAIT_FOR_ALL; }
+			| ANY				{ $$ = WAIT_FOR_ANY; }
+			| /* EMPTY */		{ $$ = WAIT_FOR_ALL; }
+		;
+
+wait_list:
+			WaitEvent					{ $$ = list_make1($1); }
+			| wait_list ',' WaitEvent	{ $$ = lappend($1, $3); }
+			| wait_list WaitEvent		{ $$ = lappend($1, $2); }
+		;
+
+WaitEvent:
+			LSN Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = 0;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| TIMEOUT Iconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = $2;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| ConstDatetime Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = 0;
+					n->time = makeStringConstCast($2, @2, $1);
+					$$ = (Node *)n;
+				}
+			;
+
 
 /*
  * Aggregate decoration clauses
@@ -17279,6 +17353,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -17411,6 +17486,7 @@ unreserved_keyword:
 			| TEMPORARY
 			| TEXT_P
 			| TIES
+			| TIMEOUT
 			| TRANSACTION
 			| TRANSFORM
 			| TRIGGER
@@ -17437,6 +17513,7 @@ unreserved_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -17869,6 +17946,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18031,6 +18109,7 @@ bare_label_keyword:
 			| THEN
 			| TIES
 			| TIME
+			| TIMEOUT
 			| TIMESTAMP
 			| TRAILING
 			| TRANSACTION
@@ -18068,6 +18147,7 @@ bare_label_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7084e18861..91130709b3 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -26,6 +26,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -153,6 +154,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, StatsShmemSize());
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -357,6 +359,11 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
 	InjectionPointShmemInit();
+
+	/*
+	 * Init array of Latches in shared memory for WAIT
+	 */
+	WaitShmemInit();
 }
 
 /*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 8de821f960..e884828011 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -59,6 +60,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -72,6 +74,9 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -272,6 +277,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -612,6 +618,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1069,6 +1080,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2847,6 +2865,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3495,6 +3517,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..0270160d44
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 476d55dd24..fd246365ef 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3527,6 +3527,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node	   *wait;			/* WAIT clause: list of events to wait for */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4078,4 +4079,26 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		WAIT FOR Statement + WAIT FOR clause of BEGIN statement
+ *		TODO: if we only pick one, remove the other
+ * ----------------------
+ */
+
+typedef enum WaitForStrategy
+{
+	WAIT_FOR_ANY = 0,
+	WAIT_FOR_ALL
+} WaitForStrategy;
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	WaitForStrategy strategy;
+	List	   *events;		/* used as a pointer to the next WAIT event */
+	char	   *lsn;		/* WAIT FOR LSN */
+	int			delay;		/* WAIT FOR TIMEOUT */
+	Node	   *time;		/* WAIT FOR TIMESTAMP or TIME */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 2331acac09..ae7b65526b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -260,6 +260,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -433,6 +434,7 @@ PG_KEYWORD("text", TEXT_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("then", THEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("ties", TIES, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("time", TIME, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("timeout", TIMEOUT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("timestamp", TIMESTAMP, COL_NAME_KEYWORD, BARE_LABEL)
 PG_KEYWORD("to", TO, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("trailing", TRAILING, RESERVED_KEYWORD, BARE_LABEL)
@@ -473,6 +475,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd9..295cd6ff3a 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT FOR", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index bf087ac2a9..6f5f2603a1 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -46,6 +46,8 @@ tests += {
       't/038_save_logical_slots_shutdown.pl',
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
+      't/041_begin_wait.pl',
+      't/042_wait.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/040_begin_wait.pl b/src/test/recovery/t/040_begin_wait.pl
new file mode 100644
index 0000000000..f1e5b5b23d
--- /dev/null
+++ b/src/test/recovery/t/040_begin_wait.pl
@@ -0,0 +1,146 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "WAIT FOR" command
+#===============================================================================
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ WAIT FOR LSN '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/041_wait.pl b/src/test/recovery/t/041_wait.pl
new file mode 100644
index 0000000000..6f9d549416
--- /dev/null
+++ b/src/test/recovery/t/041_wait.pl
@@ -0,0 +1,145 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq 0, "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq 0,
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#25Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Kartyshov Ivan (#24)
2 attachment(s)

Intro
==========
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside application. We cannot store this lsn inside database, since
reads are distributed across all replicas and primary.

Two implementations of one feature
==========
We left two different solutions. Help me please to choose the best.

1) Classic (wait_classic_v7.patch)
/messages/by-id/3cc883048264c2e9af022033925ff8db@postgrespro.ru
Synopsis
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar

WAIT FOR [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp

2) After style: Kyotaro and Freund (wait_after_within_v6.patch)
/messages/by-id/d3ff2e363af60b345f82396992595a03@postgrespro.ru
Synopsis
==========
advantages: no new words in grammar
disadvantages: a little harder to understand, fewer events to wait

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]

Examples
==========

primary standby
------- --------
postgresql.conf
recovery_min_apply_delay = 10s

CREATE TABLE tbl AS SELECT generate_series(1,10) AS a;
INSERT INTO tbl VALUES (generate_series(11, 20));
SELECT pg_current_wal_lsn();

BEGIN WAIT FOR LSN '0/3002AE8';
SELECT * FROM tbl; // read fresh insertions
COMMIT;

Rebased and ready for review.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

wait_classic_v7.patchtext/x-diff; name=wait_classic_v7.patchDownload
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 4a42999b18..657a217e27 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..b3af16c09f 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..1b54ed2084 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_for_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_for_event</replaceable> is:</phrase>
+    WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>and <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN lsn_value
+    TIMEOUT number_of_milliseconds
+    timestamp
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml
new file mode 100644
index 0000000000..26cae3ad85
--- /dev/null
+++ b/doc/src/sgml/ref/wait.sgml
@@ -0,0 +1,146 @@
+<!--
+doc/src/sgml/ref/wait.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait">
+ <indexterm zone="sql-wait">
+  <primary>WAIT FOR</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>WAIT FOR</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>WAIT FOR</refname>
+  <refpurpose>wait for the target <acronym>LSN</acronym> to be replayed or for specified time to pass</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+WAIT FOR [ANY | ALL] <replaceable class="parameter">event</replaceable> [, ...]
+
+<phrase>where <replaceable class="parameter">event</replaceable> is one of:</phrase>
+    LSN value
+    TIMEOUT number_of_milliseconds
+    timestamp
+
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>'
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>
+WAIT FOR LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ALL LSN '<replaceable class="parameter">lsn_number</replaceable>' TIMEOUT <replaceable class="parameter">wait_timeout</replaceable>, TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+WAIT FOR ANY LSN '<replaceable class="parameter">lsn_number</replaceable>', TIMESTAMP <replaceable class="parameter">wait_time</replaceable>
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>WAIT FOR</command> provides a simple interprocess communication
+   mechanism to wait for timestamp, timeout or the target log sequence number
+   (<acronym>LSN</acronym>) on standby in <productname>PostgreSQL</productname>
+   databases with master-standby asynchronous replication. When run with the
+   <replaceable>LSN</replaceable> option, the <command>WAIT FOR</command>
+   command waits for the specified <acronym>LSN</acronym> to be replayed.
+   If no timestamp or timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using <literal>Ctrl+C</literal>, or
+   by shutting down the <literal>postgres</literal> server.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">LSN</replaceable></term>
+    <listitem>
+     <para>
+      Specify the target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>TIMEOUT <replaceable class="parameter">wait_timeout</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time interval to wait for the LSN to be replayed.
+      The specified <replaceable>wait_timeout</replaceable> must be an integer
+      and is measured in milliseconds.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>UNTIL TIMESTAMP <replaceable class="parameter">wait_time</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time to wait for the LSN to be replayed.
+      The specified <replaceable>wait_time</replaceable> must be timestamp.
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Run <literal>WAIT FOR</literal> from <application>psql</application>,
+   limiting wait time to 10000 milliseconds:
+
+<screen>
+WAIT FOR LSN '0/3F07A6B1' TIMEOUT 10000;
+NOTICE:  LSN is not reached. Try to increase wait time.
+LSN reached
+-------------
+ f
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Wait until the specified <acronym>LSN</acronym> is replayed:
+<screen>
+WAIT FOR LSN '0/3F07A611';
+LSN reached
+-------------
+ t
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Limit <acronym>LSN</acronym> wait time to 500000 milliseconds,
+   and then cancel the command if <acronym>LSN</acronym> was not reached:
+<screen>
+WAIT FOR LSN '0/3F0FF791' TIMEOUT 500000;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to increase wait time.
+ERROR:  canceling statement due to user request
+ LSN reached
+-------------
+ f
+(1 row)
+</screen>
+</para>
+ </refsect1>
+
+ <refsect1>
+  <title>Compatibility</title>
+
+  <para>
+   There is no <command>WAIT FOR</command> statement in the SQL
+   standard.
+  </para>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index aa94f6adf6..04e17620dd 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -216,6 +216,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 853b540945..02906c586d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1825,6 +1826,13 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+				WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..3f06dc5341 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..8256240fa9
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,403 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2024, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyProcNumber)
+		state->backend_maxid = MyProcNumber;
+
+	state->waited_lsn[MyProcNumber] = lsn_to_wait;
+
+	if (lsn_to_wait < pg_atomic_read_u64(&state->min_lsn))
+		pg_atomic_write_u64(&state->min_lsn,lsn_to_wait);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyProcNumber];
+
+	state->waited_lsn[MyProcNumber] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (pg_atomic_read_u64(&state->min_lsn) == lsn_to_delete)
+	{
+		pg_atomic_write_u64(&state->min_lsn,PG_UINT64_MAX);
+		for (i = 1; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < pg_atomic_read_u64(&state->min_lsn))
+				pg_atomic_write_u64(&state->min_lsn,state->waited_lsn[i]);
+	}
+
+	if (state->backend_maxid == MyProcNumber)
+		for (i = (MyProcNumber); i >= 1; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < MaxBackends; i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		pg_atomic_init_u64(&state->min_lsn,PG_UINT64_MAX);
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 1; i <= backend_maxid; i++)
+	{
+		backend = ProcNumberGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return state ? pg_atomic_read_u64(&state->min_lsn) : PG_UINT64_MAX;
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint32		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/*
+ * Get the amount of seconds left till the specified time.
+ */
+float8
+WaitTimeResolve(Const *time)
+{
+	int			ret;
+	float8		val;
+	Oid			types[] = { time->consttype };
+	Datum		values[] = { time->constvalue };
+	char		nulls[] = { " " };
+	Datum		result;
+	bool		isnull;
+
+	SPI_connect();
+
+	if (time->consttype == 1083)
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()::time))",
+									1, types, values, nulls, true, 0);
+	else if (time->consttype == 1266)
+		ret = SPI_execute_with_args("select extract (epoch from (timezone('UTC',$1)::time - timezone('UTC', now()::timetz)::time))",
+									1, types, values, nulls, true, 0);
+	else
+		ret = SPI_execute_with_args("select extract (epoch from ($1 - now()))",
+									1, types, values, nulls, true, 0);
+
+	Assert(ret >= 0);
+	result = SPI_getbinval(SPI_tuptable->vals[0],
+						   SPI_tuptable->tupdesc,
+						   1, &isnull);
+
+	Assert(!isnull);
+	val = DatumGetFloat8(result);
+
+	elog(INFO, "time: %f", val);
+
+	SPI_finish();
+	return val;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	ListCell   *events;
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	float8		delay = 0;
+	float8		final_delay = 0;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	XLogRecPtr	final_lsn = InvalidXLogRecPtr;
+	bool		has_lsn = false;
+	bool		wait_forever = true;
+	int			res = 0;
+
+	if (stmt->strategy == WAIT_FOR_ANY)
+	{
+		/* Prepare to find minimum LSN and delay */
+		final_delay = DBL_MAX;
+		final_lsn = PG_UINT64_MAX;
+	}
+
+	/* Extract options from the statement node tree */
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+
+		/* LSN to wait for */
+		if (event->lsn)
+		{
+			has_lsn = true;
+			lsn = DatumGetLSN(
+						DirectFunctionCall1(pg_lsn_in,
+							CStringGetDatum(event->lsn)));
+
+			/*
+			 * When waiting for ALL, select max LSN to wait for.
+			 * When waiting for ANY, select min LSN to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_lsn <= lsn) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_lsn > lsn))
+			{
+				final_lsn = lsn;
+			}
+		}
+
+		/* Time delay to wait for */
+		if (event->time || event->delay)
+		{
+			wait_forever = false;
+
+			if (event->delay)
+				delay = event->delay / 1000.0;
+
+			if (event->time)
+			{
+				Const *time = (Const *) event->time;
+				delay = WaitTimeResolve(time);
+			}
+
+			/*
+			 * When waiting for ALL, select max delay to wait for.
+			 * When waiting for ANY, select min delay to wait for.
+			 */
+			if ((stmt->strategy == WAIT_FOR_ALL && final_delay <= delay) ||
+				(stmt->strategy == WAIT_FOR_ANY && final_delay > delay))
+			{
+				final_delay = delay;
+			}
+		}
+	}
+	if (wait_forever)
+		final_delay = 0;
+	if (!has_lsn)
+		final_lsn = InvalidXLogRecPtr;
+
+	res = WaitUtility(final_lsn, final_delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c
index 2255314c51..18be92ddbf 100644
--- a/src/backend/parser/analyze.c
+++ b/src/backend/parser/analyze.c
@@ -83,6 +83,7 @@ static Query *transformCreateTableAsStmt(ParseState *pstate,
 										 CreateTableAsStmt *stmt);
 static Query *transformCallStmt(ParseState *pstate,
 								CallStmt *stmt);
+static void transformWaitForStmt(ParseState *pstate, WaitStmt *stmt);
 static void transformLockingClause(ParseState *pstate, Query *qry,
 								   LockingClause *lc, bool pushedDown);
 #ifdef RAW_EXPRESSION_COVERAGE_TEST
@@ -403,7 +404,21 @@ transformStmt(ParseState *pstate, Node *parseTree)
 			result = transformCallStmt(pstate,
 									   (CallStmt *) parseTree);
 			break;
-
+		case T_WaitStmt:
+			transformWaitForStmt(pstate, (WaitStmt *) parseTree);
+			result = makeNode(Query);
+			result->commandType = CMD_UTILITY;
+			result->utilityStmt = (Node *) parseTree;
+			break;
+		case T_TransactionStmt:
+			{
+				TransactionStmt *stmt = (TransactionStmt *) parseTree;
+				if ((stmt->kind == TRANS_STMT_BEGIN ||
+						stmt->kind == TRANS_STMT_START) && stmt->wait)
+					transformWaitForStmt(pstate, (WaitStmt *) stmt->wait);
+			}
+			/* no break here - we want to fall through to the default */
+			/* FALLTHROUGH */
 		default:
 
 			/*
@@ -3560,6 +3575,23 @@ applyLockingClause(Query *qry, Index rtindex,
 	qry->rowMarks = lappend(qry->rowMarks, rc);
 }
 
+/*
+ * transformWaitForStmt -
+ *	transform the WAIT FOR clause of the BEGIN statement
+ *	transform the WAIT FOR statement (TODO: remove this line if we don't keep it)
+ */
+static void
+transformWaitForStmt(ParseState *pstate, WaitStmt *stmt)
+{
+	ListCell   *events;
+
+	foreach(events, stmt->events)
+	{
+		WaitStmt   *event = (WaitStmt *) lfirst(events);
+		event->time = transformExpr(pstate, event->time, EXPR_KIND_OTHER);
+	}
+}
+
 /*
  * Coverage testing for raw_expression_tree_walker().
  *
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 130f7fc7c3..dfdf6d2d78 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -537,7 +537,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <node>	case_expr case_arg when_clause case_default
 %type <list>	when_clause_list
 %type <node>	opt_search_clause opt_cycle_clause
-%type <ival>	sub_type opt_materialized
+%type <ival>	sub_type wait_strategy opt_materialized
 %type <node>	NumericOnly
 %type <list>	NumericOnly_list
 %type <alias>	alias_clause opt_alias_clause opt_alias_clause_for_join_using
@@ -645,6 +645,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <list>		wait_list
+%type <node>		WaitEvent wait_for
 
 %type <node>	json_format_clause
 				json_format_clause_opt
@@ -730,7 +732,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 	LABEL LANGUAGE LARGE_P LAST_P LATERAL_P
 	LEADING LEAKPROOF LEAST LEFT LEVEL LIKE LIMIT LISTEN LOAD LOCAL
-	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED
+	LOCALTIME LOCALTIMESTAMP LOCATION LOCK_P LOCKED LOGGED LSN
 
 	MAPPING MATCH MATCHED MATERIALIZED MAXVALUE MERGE METHOD
 	MINUTE_P MINVALUE MODE MONTH_P MOVE
@@ -764,7 +766,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	SUBSCRIPTION SUBSTRING SUPPORT SYMMETRIC SYSID SYSTEM_P SYSTEM_USER
 
 	TABLE TABLES TABLESAMPLE TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN
-	TIES TIME TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
+	TIES TIME TIMEOUT TIMESTAMP TO TRAILING TRANSACTION TRANSFORM
 	TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
@@ -774,7 +776,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
 	VERBOSE VERSION_P VIEW VIEWS VOLATILE
 
-	WHEN WHERE WHITESPACE_P WINDOW WITH WITHIN WITHOUT WORK WRAPPER WRITE
+	WAIT WHEN WHERE WHITESPACE_P WINDOW
+	WITH WITHIN WITHOUT WORK WRAPPER WRITE
 
 	XML_P XMLATTRIBUTES XMLCONCAT XMLELEMENT XMLEXISTS XMLFOREST XMLNAMESPACES
 	XMLPARSE XMLPI XMLROOT XMLSERIALIZE XMLTABLE
@@ -1103,6 +1106,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10934,12 +10938,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11038,12 +11043,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15875,6 +15881,74 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				WAIT FOR <event> [, <event> ...]
+ *				event is one of:
+ *					LSN value
+ *					TIMEOUT delay
+ *					timestamp
+ *
+ *****************************************************************************/
+WaitStmt:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			;
+wait_for:
+			WAIT FOR wait_strategy wait_list
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->strategy = $3;
+					n->events = $4;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; };
+
+wait_strategy:
+			ALL					{ $$ = WAIT_FOR_ALL; }
+			| ANY				{ $$ = WAIT_FOR_ANY; }
+			| /* EMPTY */		{ $$ = WAIT_FOR_ALL; }
+		;
+
+wait_list:
+			WaitEvent					{ $$ = list_make1($1); }
+			| wait_list ',' WaitEvent	{ $$ = lappend($1, $3); }
+			| wait_list WaitEvent		{ $$ = lappend($1, $2); }
+		;
+
+WaitEvent:
+			LSN Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = 0;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| TIMEOUT Iconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = $2;
+					n->time = NULL;
+					$$ = (Node *)n;
+				}
+			| ConstDatetime Sconst
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = NULL;
+					n->delay = 0;
+					n->time = makeStringConstCast($2, @2, $1);
+					$$ = (Node *)n;
+				}
+			;
+
 
 /*
  * Aggregate decoration clauses
@@ -17279,6 +17353,7 @@ unreserved_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -17411,6 +17486,7 @@ unreserved_keyword:
 			| TEMPORARY
 			| TEXT_P
 			| TIES
+			| TIMEOUT
 			| TRANSACTION
 			| TRANSFORM
 			| TRIGGER
@@ -17437,6 +17513,7 @@ unreserved_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHITESPACE_P
 			| WITHIN
 			| WITHOUT
@@ -17869,6 +17946,7 @@ bare_label_keyword:
 			| LOCK_P
 			| LOCKED
 			| LOGGED
+			| LSN
 			| MAPPING
 			| MATCH
 			| MATCHED
@@ -18031,6 +18109,7 @@ bare_label_keyword:
 			| THEN
 			| TIES
 			| TIME
+			| TIMEOUT
 			| TIMESTAMP
 			| TRAILING
 			| TRANSACTION
@@ -18068,6 +18147,7 @@ bare_label_keyword:
 			| VIEW
 			| VIEWS
 			| VOLATILE
+			| WAIT
 			| WHEN
 			| WHITESPACE_P
 			| WORK
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a5..485b6b34b7 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -21,6 +21,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
+#include "storage/procnumber.h"
 #include "utils/guc_hooks.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..e1cbc7e49c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -357,6 +359,11 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
 	InjectionPointShmemInit();
+
+	/*
+	 * Init array of Latches in shared memory for WAIT
+	 */
+	WaitShmemInit();
 }
 
 /*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 83f86a42f7..51e86d8425 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -15,6 +15,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <float.h>
 
 #include "access/reloptions.h"
 #include "access/twophase.h"
@@ -56,6 +57,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -265,6 +267,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -605,6 +608,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1062,6 +1070,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2840,6 +2855,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3488,6 +3507,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..ba022a5e84
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2024, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 24f5c06bb6..ba5128c116 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3523,6 +3523,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node	   *wait;			/* WAIT clause: list of events to wait for */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4074,4 +4075,26 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		WAIT FOR Statement + WAIT FOR clause of BEGIN statement
+ *		TODO: if we only pick one, remove the other
+ * ----------------------
+ */
+
+typedef enum WaitForStrategy
+{
+	WAIT_FOR_ANY = 0,
+	WAIT_FOR_ALL
+} WaitForStrategy;
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	WaitForStrategy strategy;
+	List	   *events;		/* used as a pointer to the next WAIT event */
+	char	   *lsn;		/* WAIT FOR LSN */
+	int			delay;		/* WAIT FOR TIMEOUT */
+	Node	   *time;		/* WAIT FOR TIMESTAMP or TIME */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 2331acac09..ae7b65526b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -260,6 +260,7 @@ PG_KEYWORD("location", LOCATION, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("lock", LOCK_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("locked", LOCKED, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("logged", LOGGED, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("lsn", LSN, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("mapping", MAPPING, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("match", MATCH, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("matched", MATCHED, UNRESERVED_KEYWORD, BARE_LABEL)
@@ -433,6 +434,7 @@ PG_KEYWORD("text", TEXT_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("then", THEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("ties", TIES, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("time", TIME, COL_NAME_KEYWORD, BARE_LABEL)
+PG_KEYWORD("timeout", TIMEOUT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("timestamp", TIMESTAMP, COL_NAME_KEYWORD, BARE_LABEL)
 PG_KEYWORD("to", TO, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("trailing", TRAILING, RESERVED_KEYWORD, BARE_LABEL)
@@ -473,6 +475,7 @@ PG_KEYWORD("version", VERSION_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("view", VIEW, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("views", VIEWS, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("volatile", VOLATILE, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("wait", WAIT, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("when", WHEN, RESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("where", WHERE, RESERVED_KEYWORD, AS_LABEL)
 PG_KEYWORD("whitespace", WHITESPACE_P, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd9..295cd6ff3a 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "WAIT FOR", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index c67249500e..bad8b38c87 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -50,6 +50,8 @@ tests += {
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
+      't/042_begin_wait.pl',
+      't/043_wait.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/042_begin_wait.pl b/src/test/recovery/t/042_begin_wait.pl
new file mode 100644
index 0000000000..d56ac8399a
--- /dev/null
+++ b/src/test/recovery/t/042_begin_wait.pl
@@ -0,0 +1,147 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN WAIT FOR LSN '$lsn1'");
+$node_standby->safe_psql('postgres', "COMMIT");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_ge('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq "t", "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "WAIT FOR" command
+#===============================================================================
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ WAIT FOR LSN '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"BEGIN WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_ge('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq "t",
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/043_wait.pl b/src/test/recovery/t/043_wait.pl
new file mode 100644
index 0000000000..80f3419be0
--- /dev/null
+++ b/src/test/recovery/t/043_wait.pl
@@ -0,0 +1,145 @@
+# Checks WAIT FOR
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that WAIT FOR LSN works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that WAIT is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "WAIT FOR LSN '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_ge('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq "t", "standby reached the same LSN as primary after WAIT");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+# While we're at it, also make sure that the syntax with commas works fine and
+# that by default we use WAIT FOR ALL strategy, which means waiting for max time
+$node_standby->safe_psql('postgres',
+	"WAIT FOR TIMEOUT $one_second, TIMESTAMP '$current_time'");
+my $time_waited = (time() - $start_time) * 1000; # convert to milliseconds
+ok($time_waited >= $one_second, "WAIT FOR TIMEOUT waits for enough time");
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"WAIT FOR LSN '$lsn2' TIMEOUT 1");
+ok($reached_lsn eq "f", "WAIT doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, wait for LSN which ensures a max value of 40.
+# Inside the transaction, wait for LSN that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	WAIT FOR LSN '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	WAIT FOR LSN '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "WAIT FOR LSN works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after WAIT.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to WAIT");
+
+
+
+# Get multiple LSNs for testing WAIT FOR ANY / WAIT FOR ALL
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(51, 60))");
+my $lsn5 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 70000))");
+my $lsn6 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(61, 800000))");
+my $lsn7 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Check that WAIT FOR ANY works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ANY LSN '$lsn5' LSN '$lsn6' LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn5'::pg_lsn)");
+ok($compare_lsns ge 0,
+	"WAIT FOR ANY makes us reach at least the minimum LSN from the list");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_cmp('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+# TODO: Could this somehow fail due to the machine being very fast at applying LSN?
+ok($compare_lsns lt 0,
+	"WAIT FOR ANY didn't make us reach the maximum LSN from the list");
+
+# Check that WAIT FOR ALL works fine
+$node_standby->safe_psql('postgres',
+	"WAIT FOR ALL LSN '$lsn5', LSN '$lsn6', LSN '$lsn7'");
+$lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+$compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_ge('$lsn_standby'::pg_lsn, '$lsn7'::pg_lsn)");
+ok($compare_lsns eq "t",
+	"WAIT FOR ALL makes us reach the maximum LSN from the list");
+
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
wait_after_within_v6.patchtext/x-diff; name=wait_after_within_v6.patchDownload
diff --git a/doc/src/sgml/ref/after.sgml b/doc/src/sgml/ref/after.sgml
new file mode 100644
index 0000000000..def0ec6be3
--- /dev/null
+++ b/doc/src/sgml/ref/after.sgml
@@ -0,0 +1,118 @@
+<!--
+doc/src/sgml/ref/after.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-wait">
+ <indexterm zone="sql-wait">
+  <primary>AFTER</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>AFTER</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>AFTER</refname>
+  <refpurpose>AFTER the target <acronym>LSN</acronym> to be replayed and for specified time to timeout</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+AFTER <replaceable class="parameter">LSN</replaceable> [WITHIN timeout_in_milliseconds]
+
+AFTER LSN '<replaceable class="parameter">lsn_number</replaceable>'
+AFTER LSN '<replaceable class="parameter">lsn_number</replaceable>' WITHIN <replaceable class="parameter">wait_timeout</replaceable>
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>AFTER</command> provides a simple interprocess communication
+   mechanism to wait for the target log sequence number 
+   (<acronym>LSN</acronym>) on standby in <productname>PostgreSQL</productname>
+   databases with master-standby asynchronous replication. <command>AFTER</command>
+   command waits for the specified <acronym>LSN</acronym> to be replayed.
+   If no timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using <literal>Ctrl+C</literal>, or
+   by shutting down the <literal>postgres</literal> server.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">LSN</replaceable></term>
+    <listitem>
+     <para>
+      Specify the target log sequence number to wait for.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>TIMEOUT <replaceable class="parameter">wait_timeout</replaceable></term>
+    <listitem>
+     <para>
+      Limit the time interval to AFTER the LSN to be replayed.
+      The specified <replaceable>wait_timeout</replaceable> must be an integer
+      and is measured in milliseconds.
+     </para>
+    </listitem>
+   </varlistentry>
+
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Run <literal>AFTER</literal> from <application>psql</application>,
+   limiting wait time to 10000 milliseconds:
+
+<screen>
+AFTER '0/3F07A6B1' WITHIN 10000;
+NOTICE:  LSN is not reached. Try to increase wait time.
+LSN reached
+-------------
+ f
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Wait until the specified <acronym>LSN</acronym> is replayed:
+<screen>
+AFTER '0/3F07A611';
+LSN reached
+-------------
+ t
+(1 row)
+</screen>
+  </para>
+
+  <para>
+   Limit <acronym>LSN</acronym> wait time to 500000 milliseconds,
+   and then cancel the command if <acronym>LSN</acronym> was not reached:
+<screen>
+AFTER LSN '0/3F0FF791' WITHIN 500000;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to increase wait time.
+ERROR:  canceling statement due to user request
+ LSN reached
+-------------
+ f
+(1 row)
+</screen>
+</para>
+</refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 4a42999b18..ff2e6e0f3f 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -6,6 +6,7 @@ Complete list of usable sgml source files in this directory.
 
 <!-- SQL commands -->
 <!ENTITY abort              SYSTEM "abort.sgml">
+<!ENTITY abort              SYSTEM "after.sgml">
 <!ENTITY alterAggregate     SYSTEM "alter_aggregate.sgml">
 <!ENTITY alterCollation     SYSTEM "alter_collation.sgml">
 <!ENTITY alterConversion    SYSTEM "alter_conversion.sgml">
@@ -188,6 +189,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY update             SYSTEM "update.sgml">
 <!ENTITY vacuum             SYSTEM "vacuum.sgml">
 <!ENTITY values             SYSTEM "values.sgml">
+<!ENTITY wait               SYSTEM "wait.sgml">
 
 <!-- applications and utilities -->
 <!ENTITY clusterdb          SYSTEM "clusterdb.sgml">
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..a2794763b1 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index aa94f6adf6..7c94cf9de1 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -34,6 +34,7 @@
   </partintro>
 
    &abort;
+   &after;
    &alterAggregate;
    &alterCollation;
    &alterConversion;
@@ -216,6 +217,7 @@
    &update;
    &vacuum;
    &values;
+   &wait;
 
  </reference>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 853b540945..736124208f 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1825,6 +1826,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			* If we replayed an LSN that someone was waiting for,
+			* set latches in shared memory array to notify the waiter.
+			*/
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWait())
+			{
+				 WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..3f06dc5341 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 0000000000..6f6749118d
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,298 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2024, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void AddEvent(XLogRecPtr lsn_to_wait);
+static void DeleteEvent(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64	min_lsn;
+	slock_t		mutex;
+	XLogRecPtr	waited_lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static volatile WaitState *state;
+
+/* Add the event of the current backend to the shared memory array */
+static void
+AddEvent(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(&state->mutex);
+	if (state->backend_maxid < MyProcNumber)
+		state->backend_maxid = MyProcNumber;
+
+	state->waited_lsn[MyProcNumber] = lsn_to_wait;
+
+	if (lsn_to_wait < pg_atomic_read_u64(&state->min_lsn))
+		pg_atomic_write_u64(&state->min_lsn,lsn_to_wait);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete event of the current backend from the shared memory array.
+ *
+ * TODO: Consider state cleanup on backend failure.
+ * Check:
+ * 1) nomal|smart|fast|immediate stop
+ * 2) SIGKILL and SIGTERM
+ */
+static void
+DeleteEvent(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete = state->waited_lsn[MyProcNumber];
+
+	state->waited_lsn[MyProcNumber] = InvalidXLogRecPtr;
+
+	SpinLockAcquire(&state->mutex);
+
+	/* If we need to choose the next min_lsn, update state->min_lsn */
+	if (pg_atomic_read_u64(&state->min_lsn) == lsn_to_delete)
+	{
+		pg_atomic_write_u64(&state->min_lsn,PG_UINT64_MAX);
+		for (i = 1; i <= state->backend_maxid; i++)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr &&
+				state->waited_lsn[i] < pg_atomic_read_u64(&state->min_lsn))
+				pg_atomic_write_u64(&state->min_lsn,state->waited_lsn[i]);
+	}
+
+	if (state->backend_maxid == MyProcNumber)
+		for (i = (MyProcNumber); i >= 1; i--)
+			if (state->waited_lsn[i] != InvalidXLogRecPtr)
+			{
+				state->backend_maxid = i;
+				break;
+			}
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, waited_lsn);
+	size = add_size(size, mul_size(MaxBackends, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/* Init array of events in shared memory */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	state = (WaitState *) ShmemInitStruct("pg_wait_lsn",
+										  WaitShmemSize(),
+										  &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+
+		for (i = 0; i < MaxBackends; i++)
+			state->waited_lsn[i] = InvalidXLogRecPtr;
+
+		state->backend_maxid = 0;
+		pg_atomic_init_u64(&state->min_lsn,PG_UINT64_MAX);
+	}
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitSetLatch(XLogRecPtr cur_lsn)
+{
+	uint32		i;
+	int 		backend_maxid;
+	PGPROC	   *backend;
+
+	SpinLockAcquire(&state->mutex);
+	backend_maxid = state->backend_maxid;
+	SpinLockRelease(&state->mutex);
+
+	for (i = 1; i <= backend_maxid; i++)
+	{
+		backend = ProcNumberGetProc(i);
+		if (state->waited_lsn[i] != 0)
+		{
+			if (backend && state->waited_lsn[i] <= cur_lsn)
+				SetLatch(&backend->procLatch);
+		}
+	}
+}
+
+/* Get minimal LSN that will be next */
+XLogRecPtr
+GetMinWait(void)
+{
+	return state ? pg_atomic_read_u64(&state->min_lsn) : PG_UINT64_MAX;
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed,
+ * postmaster dies or timeout happens.
+ */
+int
+WaitUtility(XLogRecPtr lsn, const float8 secs)
+{
+	XLogRecPtr	cur_lsn = GetXLogReplayRecPtr(NULL);
+	int			latch_events;
+	float8		endtime;
+	uint32		res = 0;
+
+	if (!RecoveryInProgress())
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("Work only in standby mode")));
+		return false;
+	}
+
+#define GetNowFloat()	((float8) GetCurrentTimestamp() / 1000000.0)
+	endtime = GetNowFloat() + secs;
+
+	latch_events = WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+	if (lsn != InvalidXLogRecPtr)
+	{
+		/* Just check if we reached */
+		if (lsn < cur_lsn || secs < 0)
+			return (lsn < cur_lsn);
+
+		latch_events |= WL_LATCH_SET;
+		AddEvent(lsn);
+	}
+	else if (!secs)
+		return 1;
+
+	for (;;)
+	{
+		int			rc;
+		float8		delay = 0;
+		long		delay_ms;
+
+		/* If LSN has been replayed */
+		if (lsn && lsn <= cur_lsn)
+			break;
+
+		if (secs > 0)
+			delay = endtime - GetNowFloat();
+		else if (secs == 0)
+			/*
+			* If we wait forever, then 1 minute timeout to check
+			* for Interupts.
+			*/
+			delay = 60;
+
+		if (delay > 0.0)
+			delay_ms = (long) ceil(delay * 1000.0);
+		else
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS,
+		 * then delete the current event from array.
+		 */
+		if (InterruptPending)
+		{
+			if (lsn != InvalidXLogRecPtr)
+				DeleteEvent();
+			ProcessInterrupts();
+		}
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		if (lsn && rc & WL_LATCH_SET)
+			cur_lsn = GetXLogReplayRecPtr(NULL);
+	}
+
+	if (lsn != InvalidXLogRecPtr)
+		DeleteEvent();
+
+	if (lsn != InvalidXLogRecPtr && lsn > cur_lsn)
+		elog(NOTICE,"LSN is not reached. Try to increase wait time.");
+	else
+		res = 1;
+
+	return res;
+}
+
+/* Implementation of WAIT FOR */
+int
+WaitMain(WaitStmt *stmt, DestReceiver *dest)
+{
+	TupleDesc	tupdesc;
+	TupOutputState *tstate;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
+	int		res = 0;
+
+	lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+												 CStringGetDatum(stmt->lsn)));
+	res = WaitUtility(lsn, stmt->delay);
+
+	/* Need a tuple descriptor representing a single TEXT column */
+	tupdesc = CreateTemplateTupleDesc(1);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "LSN reached", TEXTOID, -1, 0);
+	/* Prepare for projection of tuples */
+	tstate = begin_tup_output_tupdesc(dest, tupdesc, &TTSOpsMinimalTuple);
+
+	/* Send it */
+	do_text_output_oneline(tstate, res?"t":"f");
+	end_tup_output(tstate);
+	return res;
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 130f7fc7c3..37c425dbe4 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -312,7 +312,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
 		VariableResetStmt VariableSetStmt VariableShowStmt
-		ViewStmt CheckPointStmt CreateConversionStmt
+		ViewStmt WaitStmt CheckPointStmt CreateConversionStmt
 		DeallocateStmt PrepareStmt ExecuteStmt
 		DropOwnedStmt ReassignOwnedStmt
 		AlterTSConfigurationStmt AlterTSDictionaryStmt
@@ -645,6 +645,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <partboundspec> PartitionBoundSpec
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
+%type <ival>		wait_time
+%type <node>		wait_for
 
 %type <node>	json_format_clause
 				json_format_clause_opt
@@ -1103,6 +1105,7 @@ stmt:
 			| VariableSetStmt
 			| VariableShowStmt
 			| ViewStmt
+			| WaitStmt
 			| /*EMPTY*/
 				{ $$ = NULL; }
 		;
@@ -10934,12 +10937,13 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
-			| START TRANSACTION transaction_mode_list_or_empty
+			| START TRANSACTION transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_START;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -11038,12 +11042,13 @@ TransactionStmt:
 		;
 
 TransactionStmtLegacy:
-			BEGIN_P opt_transaction transaction_mode_list_or_empty
+			BEGIN_P opt_transaction transaction_mode_list_or_empty wait_for
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
 
 					n->kind = TRANS_STMT_BEGIN;
 					n->options = $3;
+					n->wait = $4;
 					n->location = -1;
 					$$ = (Node *) n;
 				}
@@ -15875,6 +15880,37 @@ xml_passing_mech:
 			| BY VALUE_P
 		;
 
+/*****************************************************************************
+ *
+ *		QUERY:
+ *				AFTER LSN_value [WITHIN delay timestamp]
+ *
+ *****************************************************************************/
+WaitStmt:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+		;
+wait_for:
+			AFTER Sconst wait_time
+				{
+					WaitStmt *n = makeNode(WaitStmt);
+					n->lsn = $2;
+					n->delay = $3;
+					$$ = (Node *)n;
+				}
+			| /* EMPTY */		{ $$ = NULL; }
+		;
+
+wait_time:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 
 /*
  * Aggregate decoration clauses
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..9be97737b5 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 83f86a42f7..c7f01c0a17 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -265,6 +266,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 		case T_LoadStmt:
 		case T_PrepareStmt:
 		case T_UnlistenStmt:
+		case T_WaitStmt:
 		case T_VariableSetStmt:
 			{
 				/*
@@ -605,6 +607,11 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 					case TRANS_STMT_START:
 						{
 							ListCell   *lc;
+							WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+
+							/* If needed to WAIT FOR something but failed */
+							if (stmt->wait && WaitMain(waitstmt, dest) == 0)
+								break;
 
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
@@ -1062,6 +1069,13 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 				break;
 			}
 
+		case T_WaitStmt:
+			{
+				WaitStmt   *stmt = (WaitStmt *) parsetree;
+				WaitMain(stmt, dest);
+				break;
+			}
+
 		default:
 			/* All other statement types have event trigger support */
 			ProcessUtilitySlow(pstate, pstmt, queryString,
@@ -2840,6 +2854,10 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_NOTIFY;
 			break;
 
+		case T_WaitStmt:
+			tag = CMDTAG_WAIT;
+			break;
+
 		case T_ListenStmt:
 			tag = CMDTAG_LISTEN;
 			break;
@@ -3488,6 +3506,10 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
+		case T_WaitStmt:
+			lev = LOGSTMT_ALL;
+			break;
+
 		case T_ListenStmt:
 			lev = LOGSTMT_ALL;
 			break;
diff --git a/src/include/commands/wait.h b/src/include/commands/wait.h
new file mode 100644
index 0000000000..ba022a5e84
--- /dev/null
+++ b/src/include/commands/wait.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * wait.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2024, Regents of PostgresPRO
+ *
+ * src/include/commands/wait.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_H
+#define WAIT_H
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern int WaitUtility(XLogRecPtr lsn, const float8 delay);
+extern Size WaitShmemSize(void);
+extern void WaitShmemInit(void);
+extern void WaitSetLatch(XLogRecPtr cur_lsn);
+extern XLogRecPtr GetMinWait(void);
+extern float8 WaitTimeResolve(Const *time);
+extern int WaitMain(WaitStmt *stmt, DestReceiver *dest);
+
+#endif   /* WAIT_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 24f5c06bb6..ed6ac96660 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3523,6 +3523,7 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	Node		*wait;			/* wait lsn clause */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
@@ -4074,4 +4075,16 @@ typedef struct DropSubscriptionStmt
 	DropBehavior behavior;		/* RESTRICT or CASCADE behavior */
 } DropSubscriptionStmt;
 
+/* ----------------------
+ *		AFTER Statement + AFTER clause of BEGIN statement
+ * ----------------------
+ */
+
+typedef struct WaitStmt
+{
+	NodeTag		type;
+	char	   *lsn;		/* LSN */
+	int			delay;		/* TIMEOUT */
+} WaitStmt;
+
 #endif							/* PARSENODES_H */
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 7fdcec6dd9..567139963a 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -217,3 +217,4 @@ PG_CMDTAG(CMDTAG_TRUNCATE_TABLE, "TRUNCATE TABLE", false, false, false)
 PG_CMDTAG(CMDTAG_UNLISTEN, "UNLISTEN", false, false, false)
 PG_CMDTAG(CMDTAG_UPDATE, "UPDATE", false, false, true)
 PG_CMDTAG(CMDTAG_VACUUM, "VACUUM", false, false, false)
+PG_CMDTAG(CMDTAG_WAIT, "AFTER", false, false, false)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index c67249500e..8dd8dc3f80 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -50,6 +50,8 @@ tests += {
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
+      't/042_begin_after.pl',
+      't/043_after.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/042_begin_after.pl b/src/test/recovery/t/042_begin_after.pl
new file mode 100644
index 0000000000..bf81cf14bc
--- /dev/null
+++ b/src/test/recovery/t/042_begin_after.pl
@@ -0,0 +1,87 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1'");
+$node_standby->safe_psql('postgres', "ROLLBACK");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_ge('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq "t", "standby reached the same LSN as primary AFTER");
+
+
+
+#===============================================================================
+# TODO: remove this test if we remove the standalone "AFTER" command
+#===============================================================================
+# We need to check that AFTER works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ AFTER '$lsn3';
+	SELECT max(a) FROM wait_test;
+	BEGIN AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/043_after.pl b/src/test/recovery/t/043_after.pl
new file mode 100644
index 0000000000..afe26d2122
--- /dev/null
+++ b/src/test/recovery/t/043_after.pl
@@ -0,0 +1,99 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 2;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_standby->safe_psql('postgres', "AFTER '$lsn1'");
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+my $lsn_standby = $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn()");
+my $compare_lsns = $node_standby->safe_psql('postgres',
+	"SELECT pg_lsn_ge('$lsn_standby'::pg_lsn, '$lsn1'::pg_lsn)");
+ok($compare_lsns eq "t", "standby reached the same LSN as primary AFTER");
+
+
+
+# Check that timeouts work on their own and let us wait for specified time (1s)
+my $current_time = $node_standby->safe_psql('postgres', "SELECT now()");
+my $one_second = 1000; # in milliseconds
+my $start_time = time();
+
+# Now, check that timeouts work as expected when waiting for LSN
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $reached_lsn = $node_standby->safe_psql('postgres',
+	"AFTER '$lsn2' WITHIN 1");
+ok($reached_lsn eq "f", "AFTER doesn't reach LSN if given too little wait time");
+
+
+
+# We need to check that WAIT works fine inside transactions. For that, let's
+# get two LSNs that will correspond to two different max values in our table.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+my $lsn3 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+my $lsn4 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+
+# Before starting transaction, AFTER which ensures a max value of 40.
+# Inside the transaction, AFTER that ensures a max value of 50.
+# Due to ISOLATION LEVEL REPEATABLE READ, we should NOT see the new max value.
+my $standby_results = $node_standby->safe_psql(
+	'postgres', qq[
+	AFTER '$lsn3';
+	BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+	SELECT max(a) FROM wait_test;
+	AFTER '$lsn4';
+	SELECT pg_last_wal_replay_lsn();
+	SELECT max(a) FROM wait_test;
+	COMMIT;
+]);
+
+# Make sure that we indeed reach primary's last LSN inside the transaction.
+# For that, check that calling pg_last_wal_replay_lsn returned that LSN.
+my $last_lsn_reached = $standby_results =~ /$lsn4/;
+ok($last_lsn_reached, "AFTER works inside a transaction");
+
+# Check that transaction doesn't break and show us the new max value after AFTER.
+# For that, make sure that the older max value is repeated twice in the results.
+my $count = () = $standby_results =~ /40/g;
+ok($count eq 2, "transaction isolation level doesn't get broken due to AFTER");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Kartyshov Ivan (#25)

On Thu, Mar 7, 2024 at 5:14 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

Intro
==========
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside application. We cannot store this lsn inside database, since
reads are distributed across all replicas and primary.

Two implementations of one feature
==========
We left two different solutions. Help me please to choose the best.

1) Classic (wait_classic_v7.patch)
/messages/by-id/3cc883048264c2e9af022033925ff8db@postgrespro.ru
Synopsis
==========
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar

WAIT FOR [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp

2) After style: Kyotaro and Freund (wait_after_within_v6.patch)
/messages/by-id/d3ff2e363af60b345f82396992595a03@postgrespro.ru
Synopsis
==========
advantages: no new words in grammar
disadvantages: a little harder to understand, fewer events to wait

AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]

+1 for the second one not only because it avoids new words in grammar
but also sounds to convey the meaning. I think you can explain in docs
how this feature can be used basically how will one get the correct
LSN value to specify.

As suggested previously also pick one of the approaches (I would
advocate the second one) and keep an option for the second one by
mentioning it in the commit message. I hope to see more
reviews/discussions or usage like how will users get the LSN value to
be specified on the core logic of the feature at this stage. IF
possible, state, how real-world applications could leverage this
feature.

--
With Regards,
Amit Kapila.

#27Alexander Korotkov
aekorotkov@gmail.com
In reply to: Amit Kapila (#26)
1 attachment(s)

Hi!

I've decided to put my hands on this patch.

On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

+1 for the second one not only because it avoids new words in grammar
but also sounds to convey the meaning. I think you can explain in docs
how this feature can be used basically how will one get the correct
LSN value to specify.

I picked the second option and left only the AFTER clause for the
BEGIN statement. I think this should be enough for the beginning.

As suggested previously also pick one of the approaches (I would
advocate the second one) and keep an option for the second one by
mentioning it in the commit message. I hope to see more
reviews/discussions or usage like how will users get the LSN value to
be specified on the core logic of the feature at this stage. IF
possible, state, how real-world applications could leverage this
feature.

I've added a paragraph to the docs about the usage. After you made
some changes on primary, you run pg_current_wal_insert_lsn(). Then
connect to replica and run 'BEGIN AFTER lsn' with the just obtained
LSN. Now you're guaranteed to see the changes made to the primary.

Also, I've significantly reworked other aspects of the patch. The
most significant changes are:
1) Waiters are now stored in the array sorted by LSN. This saves us
from scanning of wholeper-backend array.
2) Waiters are removed from the array immediately once their LSNs are
replayed. Otherwise, the WAL replayer will keep scanning the shared
memory array till waiters wake up.
3) To clean up after errors, we now call WaitLSNCleanup() on backend
shmem exit. I think this is preferable over the previous approach to
remove from the queue before ProcessInterrupts().
4) There is now condition to recheck if LSN is replayed after adding
to the shared memory array. This should save from the race
conditions.
5) I've renamed too generic names for functions and files.

------
Regards,
Alexander Korotkov

Attachments:

v8-0001-Implement-AFTER-clause-for-BEGIN-command.patchapplication/octet-stream; name=v8-0001-Implement-AFTER-clause-for-BEGIN-command.patchDownload
From 74393fe7b38c519c99a405b666c238880dd241a1 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 5 Mar 2024 13:35:03 +0200
Subject: [PATCH v8] Implement AFTER clause for BEGIN command

The new clause is to be used on standby and specifies waiting for the specific
WAL location to be replayed before starting the transaction.   This option is
useful when the user makes some data changes on primary and needs a guarantee
to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
---
 doc/src/sgml/ref/begin.sgml               |  43 +++-
 doc/src/sgml/ref/start_transaction.sgml   |   5 +-
 src/backend/access/transam/xlogrecovery.c |   7 +
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/waitlsn.c            | 297 ++++++++++++++++++++++
 src/backend/parser/gram.y                 |  32 +++
 src/backend/storage/ipc/ipci.c            |   7 +
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/utility.c                |  12 +
 src/include/commands/waitlsn.h            |  25 ++
 src/include/nodes/parsenodes.h            |   2 +
 src/test/recovery/meson.build             |   1 +
 src/test/recovery/t/042_begin_after.pl    |  55 ++++
 src/tools/pgindent/typedefs.list          |   2 +
 15 files changed, 495 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/042_begin_after.pl

diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b0214874..363ed23c668 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN <replaceable class="parameter">number_of_milliseconds</replaceable> ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -78,6 +81,35 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term><literal>AFTER</literal> <replaceable class="parameter">lsn_value</replaceable></term>
+    <listitem>
+     <para>
+       <command>AFTER</command> clause is used on standby in
+       <link linkend="streaming-replication">physical streaming replication</link>
+       and specifies waiting for the specific WAL location (<acronym>LSN</acronym>)
+       to be replayed before starting the transaction.
+     </para>
+     <para>
+       This option is useful when the user makes some data changes on primary
+       and needs a guarantee to see these changes on standby.  The LSN to wait
+       could be obtained on the primary using
+       <link linkend="functions-admin-backup"><function>pg_current_wal_insert_lsn</function></link>
+       function after committing the relevant changes.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITHIN</literal> <replaceable class="parameter">number_of_milliseconds</replaceable></term>
+    <listitem>
+     <para>
+       Provides the timeout for <command>AFTER</command> clause.  Especially
+       helpful to prevent freezing on streaming replication connection failures.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
@@ -123,6 +155,15 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
 <programlisting>
 BEGIN;
 </programlisting></para>
+
+  <para>
+   To begin a transaction block after replaying the given <acronym>LSN</acronym>.
+   The command will be cancelled if the given <acronym>LSN</acronym> is not
+   reached within timeout of one second.
+<programlisting>
+BEGIN AFTER '0/3F0FF791' WITHIN 1000;
+</programlisting></para>
+
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e3456..46a3bcf1a80 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..c211230f6b5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,12 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..13030ca3a27
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,297 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn_to_wait);
+static void deleteLSNWaiter(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+static WaitLSNState *state = NULL;
+static volatile bool haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	state = (WaitLSNState *) ShmemInitStruct("pg_wait_lsn",
+											 WaitLSNShmemSize(),
+											 &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+		state->numWaitedProcs = 0;
+		pg_atomic_init_u64(&state->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&state->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		if (state->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = state->procInfos[i];
+			state->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	state->procInfos[i] = cur;
+	state->numWaitedProcs++;
+
+	pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&state->mutex);
+
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		if (state->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < state->numWaitedProcs - 1)
+		{
+			state->procInfos[i] = state->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&state->mutex);
+		return;
+	}
+	state->numWaitedProcs--;
+
+	if (state->numWaitedProcs != 0)
+		pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&state->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&state->mutex);
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	if (!state)
+		return;
+
+	if (pg_atomic_read_u64(&state->minLSN) > curLSN)
+		return;
+
+	SpinLockAcquire(&state->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (state->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(state->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise
+	 * shmem array items will be here will corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < state->numWaitedProcs - numWakeUpProcs; i++)
+		state->procInfos[i] = state->procInfos[i + numWakeUpProcs];
+	state->numWaitedProcs -= numWakeUpProcs;
+
+	if (state->numWaitedProcs != 0)
+		pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&state->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed, postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, int millisecs)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + millisecs * 1000;
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (millisecs > 0)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+
+			/*
+			 * If we wait forever, then 1 minute timeout to check for
+			 * Interupts.
+			 */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c6e2f679fd5..be97f5a8700 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -647,6 +647,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
 
+%type <ival>		after_lsn_timeout
+
 %type <node>	json_format_clause
 				json_format_clause_opt
 				json_value_expr
@@ -3193,6 +3195,14 @@ hash_partbound:
 			}
 		;
 
+/*
+ * WITHIN timeout optional clause for BEGIN AFTER lsn
+ */
+after_lsn_timeout:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 /*****************************************************************************
  *
  *	ALTER TYPE
@@ -10949,6 +10959,17 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| START TRANSACTION transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_START;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| COMMIT opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
@@ -11053,6 +11074,17 @@ TransactionStmtLegacy:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| BEGIN_P opt_transaction transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_BEGIN;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| END_P opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f3e20038f42..cebcf79667e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 83f86a42f79..bf9a685a9ae 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -63,8 +64,10 @@
 #include "storage/fd.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
+#include "utils/fmgrprotos.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -606,6 +609,15 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 						{
 							ListCell   *lc;
 
+							if (stmt->afterLsn)
+							{
+								XLogRecPtr	lsn = InvalidXLogRecPtr;
+
+								lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+																	  CStringGetDatum(stmt->afterLsn)));
+								WaitForLSN(lsn, stmt->afterLsnTimeout);
+							}
+
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
 							{
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..94db0c191e8
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern void WaitForLSN(XLogRecPtr lsn, int millisecs);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr cur_lsn);
+extern void WaitLSNCleanup(void);
+
+#endif   /* WAIT_LSN_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 2380821600a..82fb548ce3f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3526,6 +3526,8 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	char	   *afterLsn;		/* target LSN to wait */
+	int			afterLsnTimeout; /* LSN waiting timeout */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index c67249500e6..676f879166f 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -50,6 +50,7 @@ tests += {
       't/039_end_of_wal.pl',
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
+      't/042_begin_after.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/042_begin_after.pl b/src/test/recovery/t/042_begin_after.pl
new file mode 100644
index 00000000000..aadbaf229ee
--- /dev/null
+++ b/src/test/recovery/t/042_begin_after.pl
@@ -0,0 +1,55 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay        = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql('postgres', qq[
+	BEGIN AFTER '${lsn1}';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok( $output eq 0, "standby reached the same LSN as primary AFTER");
+
+my $lsn2 = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1' WITHIN 1");
+$node_standby->psql('postgres', "BEGIN AFTER '$lsn2' WITHIN 1",
+					stderr => \$stderr);
+ok($stderr =~ /canceling waiting for LSN due to timeout/, "get timeout on waiting for unreachable LSN");
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3a7f75b080..7bf3aaa9de5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3033,6 +3033,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#28Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#27)
1 attachment(s)

On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

I've decided to put my hands on this patch.

On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

+1 for the second one not only because it avoids new words in grammar
but also sounds to convey the meaning. I think you can explain in docs
how this feature can be used basically how will one get the correct
LSN value to specify.

I picked the second option and left only the AFTER clause for the
BEGIN statement. I think this should be enough for the beginning.

As suggested previously also pick one of the approaches (I would
advocate the second one) and keep an option for the second one by
mentioning it in the commit message. I hope to see more
reviews/discussions or usage like how will users get the LSN value to
be specified on the core logic of the feature at this stage. IF
possible, state, how real-world applications could leverage this
feature.

I've added a paragraph to the docs about the usage. After you made
some changes on primary, you run pg_current_wal_insert_lsn(). Then
connect to replica and run 'BEGIN AFTER lsn' with the just obtained
LSN. Now you're guaranteed to see the changes made to the primary.

Also, I've significantly reworked other aspects of the patch. The
most significant changes are:
1) Waiters are now stored in the array sorted by LSN. This saves us
from scanning of wholeper-backend array.
2) Waiters are removed from the array immediately once their LSNs are
replayed. Otherwise, the WAL replayer will keep scanning the shared
memory array till waiters wake up.
3) To clean up after errors, we now call WaitLSNCleanup() on backend
shmem exit. I think this is preferable over the previous approach to
remove from the queue before ProcessInterrupts().
4) There is now condition to recheck if LSN is replayed after adding
to the shared memory array. This should save from the race
conditions.
5) I've renamed too generic names for functions and files.

I went through this patch another time, and made some minor
adjustments. Now it looks good, I'm going to push it if no
objections.

------
Regards,
Alexander Korotkov

Attachments:

v9-0001-Implement-AFTER-clause-for-BEGIN-command.patchapplication/octet-stream; name=v9-0001-Implement-AFTER-clause-for-BEGIN-command.patchDownload
From f9d41b7158509d6cab21276f220dbebd85ca67e1 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 5 Mar 2024 13:35:03 +0200
Subject: [PATCH v9] Implement AFTER clause for BEGIN command

The new clause is to be used on standby and specifies waiting for the specific
WAL location to be replayed before starting the transaction.   This option is
useful when the user makes some data changes on primary and needs a guarantee
to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
---
 doc/src/sgml/ref/begin.sgml               |  44 +++-
 doc/src/sgml/ref/start_transaction.sgml   |   5 +-
 src/backend/access/transam/xlogrecovery.c |   7 +
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/waitlsn.c            | 305 ++++++++++++++++++++++
 src/backend/parser/gram.y                 |  32 +++
 src/backend/storage/ipc/ipci.c            |   7 +
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/utility.c                |  12 +
 src/include/commands/waitlsn.h            |  25 ++
 src/include/nodes/parsenodes.h            |   2 +
 src/test/recovery/meson.build             |   1 +
 src/test/recovery/t/043_begin_after.pl    |  62 +++++
 src/tools/pgindent/typedefs.list          |   2 +
 15 files changed, 511 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_begin_after.pl

diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b0214874..4ed7002683e 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN <replaceable class="parameter">number_of_milliseconds</replaceable> ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -78,6 +81,36 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term><literal>AFTER</literal> <replaceable class="parameter">lsn_value</replaceable></term>
+    <listitem>
+     <para>
+       <command>AFTER</command> clause is used on standby in
+       <link linkend="streaming-replication">physical streaming replication</link>
+       and specifies waiting for the specific WAL location (<acronym>LSN</acronym>)
+       to be replayed before starting the transaction.
+     </para>
+     <para>
+       This option is useful when the user makes some data changes on primary
+       and needs a guarantee to see these changes on standby.  The LSN to wait
+       could be obtained on the primary using
+       <link linkend="functions-admin-backup"><function>pg_current_wal_insert_lsn</function></link>
+       function after committing the relevant changes.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITHIN</literal> <replaceable class="parameter">number_of_milliseconds</replaceable></term>
+    <listitem>
+     <para>
+       Provides the timeout for the <command>AFTER</command> clause.
+       Especially helpful to prevent freezing on streaming replication
+       connection failures.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
@@ -123,6 +156,15 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
 <programlisting>
 BEGIN;
 </programlisting></para>
+
+  <para>
+   To begin a transaction block after replaying the given <acronym>LSN</acronym>.
+   The command will be canceled if the given <acronym>LSN</acronym> is not
+   reached within the timeout of one second.
+<programlisting>
+BEGIN AFTER '0/3F0FF791' WITHIN 1000;
+</programlisting></para>
+
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e3456..46a3bcf1a80 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..c211230f6b5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,12 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..fd7c8f44ee7
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,305 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn_to_wait);
+static void deleteLSNWaiter(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+static WaitLSNState *state = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	state = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											 WaitLSNShmemSize(),
+											 &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+		state->numWaitedProcs = 0;
+		pg_atomic_init_u64(&state->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&state->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		if (state->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = state->procInfos[i];
+			state->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	state->procInfos[i] = cur;
+	state->numWaitedProcs++;
+
+	pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&state->mutex);
+
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		if (state->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < state->numWaitedProcs - 1)
+		{
+			state->procInfos[i] = state->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&state->mutex);
+		return;
+	}
+	state->numWaitedProcs--;
+
+	if (state->numWaitedProcs != 0)
+		pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&state->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&state->mutex);
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	if (!state)
+		return;
+
+	/*
+	 * Fast check if we reached the LSN at least one process is waiting for.
+	 * This is important because saves us from acquiring the spinlock for
+	 * every WAL record replayed.
+	 */
+	if (pg_atomic_read_u64(&state->minLSN) > curLSN)
+		return;
+
+	SpinLockAcquire(&state->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (state->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(state->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < state->numWaitedProcs - numWakeUpProcs; i++)
+		state->procInfos[i] = state->procInfos[i + numWakeUpProcs];
+	state->numWaitedProcs -= numWakeUpProcs;
+
+	if (state->numWaitedProcs != 0)
+		pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&state->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed, postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, int millisecs)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(state);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + millisecs * 1000;
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (millisecs > 0)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+
+			/* If no timeout is set then wake up in 1 minute for interupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c6e2f679fd5..be97f5a8700 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -647,6 +647,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
 
+%type <ival>		after_lsn_timeout
+
 %type <node>	json_format_clause
 				json_format_clause_opt
 				json_value_expr
@@ -3193,6 +3195,14 @@ hash_partbound:
 			}
 		;
 
+/*
+ * WITHIN timeout optional clause for BEGIN AFTER lsn
+ */
+after_lsn_timeout:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 /*****************************************************************************
  *
  *	ALTER TYPE
@@ -10949,6 +10959,17 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| START TRANSACTION transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_START;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| COMMIT opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
@@ -11053,6 +11074,17 @@ TransactionStmtLegacy:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| BEGIN_P opt_transaction transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_BEGIN;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| END_P opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919db..4b830dc3c85 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 83f86a42f79..bf9a685a9ae 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -63,8 +64,10 @@
 #include "storage/fd.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
+#include "utils/fmgrprotos.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -606,6 +609,15 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 						{
 							ListCell   *lc;
 
+							if (stmt->afterLsn)
+							{
+								XLogRecPtr	lsn = InvalidXLogRecPtr;
+
+								lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+																	  CStringGetDatum(stmt->afterLsn)));
+								WaitForLSN(lsn, stmt->afterLsnTimeout);
+							}
+
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
 							{
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..bfc90a4102d
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern void WaitForLSN(XLogRecPtr lsn, int millisecs);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr cur_lsn);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index aadaf67f574..c2077b5252e 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3527,6 +3527,8 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	char	   *afterLsn;		/* target LSN to wait */
+	int			afterLsnTimeout;	/* LSN waiting timeout */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..a1c2a0b13d2 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_begin_after.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_begin_after.pl b/src/test/recovery/t/043_begin_after.pl
new file mode 100644
index 00000000000..f7b43f5db1e
--- /dev/null
+++ b/src/test/recovery/t/043_begin_after.pl
@@ -0,0 +1,62 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN AFTER '${lsn1}';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary AFTER");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1' WITHIN 1");
+$node_standby->psql(
+	'postgres',
+	"BEGIN AFTER '$lsn2' WITHIN 1",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index aa7a25b8f8c..7a2ac178bc3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3035,6 +3035,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#29Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#28)
1 attachment(s)

On Fri, Mar 15, 2024 at 4:20 PM Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

I've decided to put my hands on this patch.

On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

+1 for the second one not only because it avoids new words in grammar
but also sounds to convey the meaning. I think you can explain in docs
how this feature can be used basically how will one get the correct
LSN value to specify.

I picked the second option and left only the AFTER clause for the
BEGIN statement. I think this should be enough for the beginning.

As suggested previously also pick one of the approaches (I would
advocate the second one) and keep an option for the second one by
mentioning it in the commit message. I hope to see more
reviews/discussions or usage like how will users get the LSN value to
be specified on the core logic of the feature at this stage. IF
possible, state, how real-world applications could leverage this
feature.

I've added a paragraph to the docs about the usage. After you made
some changes on primary, you run pg_current_wal_insert_lsn(). Then
connect to replica and run 'BEGIN AFTER lsn' with the just obtained
LSN. Now you're guaranteed to see the changes made to the primary.

Also, I've significantly reworked other aspects of the patch. The
most significant changes are:
1) Waiters are now stored in the array sorted by LSN. This saves us
from scanning of wholeper-backend array.
2) Waiters are removed from the array immediately once their LSNs are
replayed. Otherwise, the WAL replayer will keep scanning the shared
memory array till waiters wake up.
3) To clean up after errors, we now call WaitLSNCleanup() on backend
shmem exit. I think this is preferable over the previous approach to
remove from the queue before ProcessInterrupts().
4) There is now condition to recheck if LSN is replayed after adding
to the shared memory array. This should save from the race
conditions.
5) I've renamed too generic names for functions and files.

I went through this patch another time, and made some minor
adjustments. Now it looks good, I'm going to push it if no
objections.

The revised patch version with cosmetic fixes proposed by Alexander Lakhin.

------
Regards,
Alexander Korotkov

Attachments:

v10-0001-Implement-AFTER-clause-for-BEGIN-command.patchapplication/octet-stream; name=v10-0001-Implement-AFTER-clause-for-BEGIN-command.patchDownload
From d623832d4c7c4d31ebc080416c79b2e2d2cfd1cc Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 5 Mar 2024 13:35:03 +0200
Subject: [PATCH v10] Implement AFTER clause for BEGIN command

The new clause is to be used on standby and specifies waiting for the specific
WAL location to be replayed before starting the transaction.   This option is
useful when the user makes some data changes on primary and needs a guarantee
to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin
---
 doc/src/sgml/ref/begin.sgml               |  44 +++-
 doc/src/sgml/ref/start_transaction.sgml   |   5 +-
 src/backend/access/transam/xlogrecovery.c |   7 +
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/waitlsn.c            | 305 ++++++++++++++++++++++
 src/backend/parser/gram.y                 |  32 +++
 src/backend/storage/ipc/ipci.c            |   7 +
 src/backend/storage/lmgr/proc.c           |   6 +
 src/backend/tcop/utility.c                |  12 +
 src/include/commands/waitlsn.h            |  24 ++
 src/include/nodes/parsenodes.h            |   2 +
 src/test/recovery/meson.build             |   1 +
 src/test/recovery/t/043_begin_after.pl    |  62 +++++
 src/tools/pgindent/typedefs.list          |   2 +
 15 files changed, 510 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_begin_after.pl

diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b0214874..4ed7002683e 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN <replaceable class="parameter">number_of_milliseconds</replaceable> ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -78,6 +81,36 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term><literal>AFTER</literal> <replaceable class="parameter">lsn_value</replaceable></term>
+    <listitem>
+     <para>
+       <command>AFTER</command> clause is used on standby in
+       <link linkend="streaming-replication">physical streaming replication</link>
+       and specifies waiting for the specific WAL location (<acronym>LSN</acronym>)
+       to be replayed before starting the transaction.
+     </para>
+     <para>
+       This option is useful when the user makes some data changes on primary
+       and needs a guarantee to see these changes on standby.  The LSN to wait
+       could be obtained on the primary using
+       <link linkend="functions-admin-backup"><function>pg_current_wal_insert_lsn</function></link>
+       function after committing the relevant changes.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITHIN</literal> <replaceable class="parameter">number_of_milliseconds</replaceable></term>
+    <listitem>
+     <para>
+       Provides the timeout for the <command>AFTER</command> clause.
+       Especially helpful to prevent freezing on streaming replication
+       connection failures.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
@@ -123,6 +156,15 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
 <programlisting>
 BEGIN;
 </programlisting></para>
+
+  <para>
+   To begin a transaction block after replaying the given <acronym>LSN</acronym>.
+   The command will be canceled if the given <acronym>LSN</acronym> is not
+   reached within the timeout of one second.
+<programlisting>
+BEGIN AFTER '0/3F0FF791' WITHIN 1000;
+</programlisting></para>
+
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e3456..46a3bcf1a80 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..c211230f6b5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,12 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..f2d641c953d
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,305 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  BEGIN AFTER ... [ WITHIN ... ] clause.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+static WaitLSNState *state = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	state = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											 WaitLSNShmemSize(),
+											 &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+		state->numWaitedProcs = 0;
+		pg_atomic_init_u64(&state->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&state->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		if (state->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = state->procInfos[i];
+			state->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	state->procInfos[i] = cur;
+	state->numWaitedProcs++;
+
+	pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&state->mutex);
+
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		if (state->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < state->numWaitedProcs - 1)
+		{
+			state->procInfos[i] = state->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&state->mutex);
+		return;
+	}
+	state->numWaitedProcs--;
+
+	if (state->numWaitedProcs != 0)
+		pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&state->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&state->mutex);
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	if (!state)
+		return;
+
+	/*
+	 * Fast check if we reached the LSN at least one process is waiting for.
+	 * This is important because saves us from acquiring the spinlock for
+	 * every WAL record replayed.
+	 */
+	if (pg_atomic_read_u64(&state->minLSN) > curLSN)
+		return;
+
+	SpinLockAcquire(&state->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (state->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(state->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < state->numWaitedProcs - numWakeUpProcs; i++)
+		state->procInfos[i] = state->procInfos[i + numWakeUpProcs];
+	state->numWaitedProcs -= numWakeUpProcs;
+
+	if (state->numWaitedProcs != 0)
+		pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&state->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, int millisecs)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(state);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + millisecs * 1000;
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (millisecs > 0)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+
+			/* If no timeout is set then wake up in 1 minute for interrupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c6e2f679fd5..be97f5a8700 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -647,6 +647,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
 
+%type <ival>		after_lsn_timeout
+
 %type <node>	json_format_clause
 				json_format_clause_opt
 				json_value_expr
@@ -3193,6 +3195,14 @@ hash_partbound:
 			}
 		;
 
+/*
+ * WITHIN timeout optional clause for BEGIN AFTER lsn
+ */
+after_lsn_timeout:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 /*****************************************************************************
  *
  *	ALTER TYPE
@@ -10949,6 +10959,17 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| START TRANSACTION transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_START;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| COMMIT opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
@@ -11053,6 +11074,17 @@ TransactionStmtLegacy:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| BEGIN_P opt_transaction transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_BEGIN;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| END_P opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919db..4b830dc3c85 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 83f86a42f79..bf9a685a9ae 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -63,8 +64,10 @@
 #include "storage/fd.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
+#include "utils/fmgrprotos.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -606,6 +609,15 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 						{
 							ListCell   *lc;
 
+							if (stmt->afterLsn)
+							{
+								XLogRecPtr	lsn = InvalidXLogRecPtr;
+
+								lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+																	  CStringGetDatum(stmt->afterLsn)));
+								WaitForLSN(lsn, stmt->afterLsnTimeout);
+							}
+
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
 							{
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..97f50bc4c77
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern void WaitForLSN(XLogRecPtr lsn, int millisecs);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index aadaf67f574..c2077b5252e 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3527,6 +3527,8 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	char	   *afterLsn;		/* target LSN to wait */
+	int			afterLsnTimeout;	/* LSN waiting timeout */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..a1c2a0b13d2 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_begin_after.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_begin_after.pl b/src/test/recovery/t/043_begin_after.pl
new file mode 100644
index 00000000000..f7b43f5db1e
--- /dev/null
+++ b/src/test/recovery/t/043_begin_after.pl
@@ -0,0 +1,62 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN AFTER '${lsn1}';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary AFTER");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1' WITHIN 1");
+$node_standby->psql(
+	'postgres',
+	"BEGIN AFTER '$lsn2' WITHIN 1",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index aa7a25b8f8c..7a2ac178bc3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3035,6 +3035,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#30Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Alexander Korotkov (#27)
1 attachment(s)

On 2024-03-11 13:44, Alexander Korotkov wrote:

I picked the second option and left only the AFTER clause for the
BEGIN statement. I think this should be enough for the beginning.

Thank you for your rework on your patch, here I made some fixes:
0) autocomplete
1) less jumps
2) more description and add cases in doc

I think, it will be useful to have stand-alone statement.
Why you would like to see only AFTER clause for the BEGIN statement?

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v10_after_within.patchtext/x-diff; name=v10_after_within.patchDownload
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..759e46ec24 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN <replaceable class="parameter">number_of_milliseconds</replaceable> ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -78,6 +81,50 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term><literal>AFTER</literal> <replaceable class="parameter">lsn_value</replaceable></term>
+    <listitem>
+     <para>
+       <command>AFTER</command> clause is used on standby in
+       <link linkend="streaming-replication">physical streaming replication</link>
+       and specifies waiting for the specific WAL location (<acronym>LSN</acronym>)
+       to be replayed before starting the transaction.
+     </para>
+     <para>
+       This option is useful when the user makes some data changes on primary
+       and needs a guarantee to see these changes on standby.  The LSN to wait
+       could be obtained on the primary using
+       <link linkend="functions-admin-backup"><function>pg_current_wal_insert_lsn</function></link>
+       function after committing the relevant changes.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITHIN</literal> <replaceable class="parameter">number_of_milliseconds</replaceable></term>
+    <listitem>
+     <para>
+       Provides the timeout for the <command>AFTER</command> clause.
+       Especially helpful to prevent freezing on streaming replication
+       connection failures.
+     </para>
+     <para>
+       On practice we  have three variants of waiting current LSN:
+          waiting forever - (finish on PM DEATH or SIGINT)
+             BEGIN AFTER lsn;
+                wait for changes to be replayed on standby and then start transaction
+          waiting on timeout -
+             BEGIN AFTER lsn WITHIN 60000;
+                wait changes for 60 seconds and then cancel to start transaction
+                to perevent freezing on streaming replication connection failures.
+          no wait, just check -
+             BEGIN AFTER lsn WITHIN 1;
+                some time it is useful just check if changes was replayed
+       <phrase>where <replaceable class="parameter">number_of_milliseconds</replaceable> can:</phrase>
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
@@ -123,6 +170,33 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
 <programlisting>
 BEGIN;
 </programlisting></para>
+
+  <para>
+   To begin a transaction block after replaying the given <acronym>LSN</acronym>.
+   The command will be canceled if the given <acronym>LSN</acronym> is not
+   reached within the timeout of one second.
+<programlisting>
+BEGIN AFTER '0/3F05F791' WITHIN 1000;
+BEGIN
+</programlisting></para>
+
+  <para>
+   This way we just check (without waiting) if <acronym>LSN</acronym> was replayed,
+   and if it was then start transaction.
+<programlisting>
+BEGIN AFTER '0/3F0FF791' WITHIN 1;
+ERROR:  canceling waiting for LSN due to timeout
+</programlisting></para>
+
+  <para>
+   To begin a transaction block after replaying the given <acronym>LSN</acronym>.
+   The command will be canceled only on user request or postmaster death.
+<programlisting>
+BEGIN AFTER '0/3F0FF791';
+^CCancel request sent
+ERROR:  canceling statement due to user request
+</programlisting></para>
+
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..89b997ad47 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+					(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..a6987869b0
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements WAIT FOR, which allows waiting for events such as
+ *	  time passing or LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn_to_wait);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											 WaitLSNShmemSize(),
+											 &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(waitLSN->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * On WAIT use MyLatch to wait till LSN is replayed, postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, int millisecs)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + millisecs * 1000;
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (millisecs > 0)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+
+			/* If no timeout is set then wake up in 1 minute for interupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c6e2f679fd..be97f5a870 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -647,6 +647,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
 
+%type <ival>		after_lsn_timeout
+
 %type <node>	json_format_clause
 				json_format_clause_opt
 				json_value_expr
@@ -3193,6 +3195,14 @@ hash_partbound:
 			}
 		;
 
+/*
+ * WITHIN timeout optional clause for BEGIN AFTER lsn
+ */
+after_lsn_timeout:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 /*****************************************************************************
  *
  *	ALTER TYPE
@@ -10949,6 +10959,17 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| START TRANSACTION transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_START;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| COMMIT opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
@@ -11053,6 +11074,17 @@ TransactionStmtLegacy:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| BEGIN_P opt_transaction transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_BEGIN;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| END_P opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..5aed90c935 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919d..4b830dc3c8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 83f86a42f7..bf9a685a9a 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -63,8 +64,10 @@
 #include "storage/fd.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
+#include "utils/fmgrprotos.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -606,6 +609,15 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 						{
 							ListCell   *lc;
 
+							if (stmt->afterLsn)
+							{
+								XLogRecPtr	lsn = InvalidXLogRecPtr;
+
+								lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+																	  CStringGetDatum(stmt->afterLsn)));
+								WaitForLSN(lsn, stmt->afterLsnTimeout);
+							}
+
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
 							{
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 19db069ee9..28e22cb9d0 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2717,7 +2717,7 @@ psql_completion(const char *text, int start, int end)
 
 /* BEGIN */
 	else if (Matches("BEGIN"))
-		COMPLETE_WITH("WORK", "TRANSACTION", "ISOLATION LEVEL", "READ", "DEFERRABLE", "NOT DEFERRABLE");
+		COMPLETE_WITH("WORK", "TRANSACTION", "ISOLATION LEVEL", "READ", "DEFERRABLE", "NOT DEFERRABLE", "AFTER");
 /* END, ABORT */
 	else if (Matches("END|ABORT"))
 		COMPLETE_WITH("AND", "WORK", "TRANSACTION");
@@ -4610,7 +4610,7 @@ psql_completion(const char *text, int start, int end)
 
 /* START TRANSACTION */
 	else if (Matches("START"))
-		COMPLETE_WITH("TRANSACTION");
+		COMPLETE_WITH("TRANSACTION", "AFTER");
 
 /* TABLE, but not TABLE embedded in other commands */
 	else if (Matches("TABLE"))
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..bdda60eb53
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,44 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  prototypes for commands/wait.c
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2016, Regents of PostgresPRO
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+extern void WaitForLSN(XLogRecPtr lsn, int millisecs);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr cur_lsn);
+extern void WaitLSNCleanup(void);
+
+/* Shared memory structure */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index aadaf67f57..c2077b5252 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3527,6 +3527,8 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	char	   *afterLsn;		/* target LSN to wait */
+	int			afterLsnTimeout;	/* LSN waiting timeout */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..a1c2a0b13d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_begin_after.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_begin_after.pl b/src/test/recovery/t/043_begin_after.pl
new file mode 100644
index 0000000000..e7859623b5
--- /dev/null
+++ b/src/test/recovery/t/043_begin_after.pl
@@ -0,0 +1,61 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN AFTER '${lsn1}';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary AFTER");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1' WITHIN 1");
+$node_standby->psql(
+	'postgres',
+	"BEGIN AFTER '$lsn2' WITHIN 1",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index aa7a25b8f8..7a2ac178bc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3035,6 +3035,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
#31Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Kartyshov Ivan (#30)
1 attachment(s)

On 2024-03-15 22:59, Kartyshov Ivan wrote:

On 2024-03-11 13:44, Alexander Korotkov wrote:

I picked the second option and left only the AFTER clause for the
BEGIN statement. I think this should be enough for the beginning.

Thank you for your rework on your patch, here I made some fixes:
0) autocomplete
1) less jumps
2) more description and add cases in doc

I think, it will be useful to have stand-alone statement.
Why you would like to see only AFTER clause for the BEGIN statement?

Rebase and update patch.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v11-0001-Implement-AFTER-clause-for-BEGIN-command.patchtext/x-diff; name=v11-0001-Implement-AFTER-clause-for-BEGIN-command.patchDownload
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..759e46ec24 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN <replaceable class="parameter">number_of_milliseconds</replaceable> ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -78,6 +81,50 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term><literal>AFTER</literal> <replaceable class="parameter">lsn_value</replaceable></term>
+    <listitem>
+     <para>
+       <command>AFTER</command> clause is used on standby in
+       <link linkend="streaming-replication">physical streaming replication</link>
+       and specifies waiting for the specific WAL location (<acronym>LSN</acronym>)
+       to be replayed before starting the transaction.
+     </para>
+     <para>
+       This option is useful when the user makes some data changes on primary
+       and needs a guarantee to see these changes on standby.  The LSN to wait
+       could be obtained on the primary using
+       <link linkend="functions-admin-backup"><function>pg_current_wal_insert_lsn</function></link>
+       function after committing the relevant changes.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITHIN</literal> <replaceable class="parameter">number_of_milliseconds</replaceable></term>
+    <listitem>
+     <para>
+       Provides the timeout for the <command>AFTER</command> clause.
+       Especially helpful to prevent freezing on streaming replication
+       connection failures.
+     </para>
+     <para>
+       On practice we  have three variants of waiting current LSN:
+          waiting forever - (finish on PM DEATH or SIGINT)
+             BEGIN AFTER lsn;
+                wait for changes to be replayed on standby and then start transaction
+          waiting on timeout -
+             BEGIN AFTER lsn WITHIN 60000;
+                wait changes for 60 seconds and then cancel to start transaction
+                to perevent freezing on streaming replication connection failures.
+          no wait, just check -
+             BEGIN AFTER lsn WITHIN 1;
+                some time it is useful just check if changes was replayed
+       <phrase>where <replaceable class="parameter">number_of_milliseconds</replaceable> can:</phrase>
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
@@ -123,6 +170,33 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
 <programlisting>
 BEGIN;
 </programlisting></para>
+
+  <para>
+   To begin a transaction block after replaying the given <acronym>LSN</acronym>.
+   The command will be canceled if the given <acronym>LSN</acronym> is not
+   reached within the timeout of one second.
+<programlisting>
+BEGIN AFTER '0/3F05F791' WITHIN 1000;
+BEGIN
+</programlisting></para>
+
+  <para>
+   This way we just check (without waiting) if <acronym>LSN</acronym> was replayed,
+   and if it was then start transaction.
+<programlisting>
+BEGIN AFTER '0/3F0FF791' WITHIN 1;
+ERROR:  canceling waiting for LSN due to timeout
+</programlisting></para>
+
+  <para>
+   To begin a transaction block after replaying the given <acronym>LSN</acronym>.
+   The command will be canceled only on user request or postmaster death.
+<programlisting>
+BEGIN AFTER '0/3F0FF791';
+^CCancel request sent
+ERROR:  canceling statement due to user request
+</programlisting></para>
+
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..89b997ad47 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+					(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..ca15afa468
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  BEGIN AFTER ... [ WITHIN ... ] clause.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											 WaitLSNShmemSize(),
+											 &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(waitLSN->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, int millisecs)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + millisecs * 1000;
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (millisecs > 0)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+
+			/* If no timeout is set then wake up in 1 minute for interrupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c6e2f679fd..be97f5a870 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -647,6 +647,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type <list>		hash_partbound
 %type <defelt>		hash_partbound_elem
 
+%type <ival>		after_lsn_timeout
+
 %type <node>	json_format_clause
 				json_format_clause_opt
 				json_value_expr
@@ -3193,6 +3195,14 @@ hash_partbound:
 			}
 		;
 
+/*
+ * WITHIN timeout optional clause for BEGIN AFTER lsn
+ */
+after_lsn_timeout:
+			WITHIN Iconst		{ $$ = $2; }
+			| /* EMPTY */           { $$ = 0; }
+		;
+
 /*****************************************************************************
  *
  *	ALTER TYPE
@@ -10949,6 +10959,17 @@ TransactionStmt:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| START TRANSACTION transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_START;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| COMMIT opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
@@ -11053,6 +11074,17 @@ TransactionStmtLegacy:
 					n->location = -1;
 					$$ = (Node *) n;
 				}
+			| BEGIN_P opt_transaction transaction_mode_list_or_empty AFTER Sconst after_lsn_timeout
+				{
+					TransactionStmt *n = makeNode(TransactionStmt);
+
+					n->kind = TRANS_STMT_BEGIN;
+					n->options = $3;
+					n->afterLsn = $5;
+					n->afterLsnTimeout = $6;
+					n->location = -1;
+					$$ = (Node *) n;
+				}
 			| END_P opt_transaction opt_transaction_chain
 				{
 					TransactionStmt *n = makeNode(TransactionStmt);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..5aed90c935 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919d..4b830dc3c8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 83f86a42f7..bf9a685a9a 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -56,6 +56,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/view.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
 #include "postmaster/bgwriter.h"
@@ -63,8 +64,10 @@
 #include "storage/fd.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
+#include "utils/fmgrprotos.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
+#include "utils/pg_lsn.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -606,6 +609,15 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 						{
 							ListCell   *lc;
 
+							if (stmt->afterLsn)
+							{
+								XLogRecPtr	lsn = InvalidXLogRecPtr;
+
+								lsn = DatumGetLSN(DirectFunctionCall1(pg_lsn_in,
+																	  CStringGetDatum(stmt->afterLsn)));
+								WaitForLSN(lsn, stmt->afterLsnTimeout);
+							}
+
 							BeginTransactionBlock();
 							foreach(lc, stmt->options)
 							{
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 19db069ee9..28e22cb9d0 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2717,7 +2717,7 @@ psql_completion(const char *text, int start, int end)
 
 /* BEGIN */
 	else if (Matches("BEGIN"))
-		COMPLETE_WITH("WORK", "TRANSACTION", "ISOLATION LEVEL", "READ", "DEFERRABLE", "NOT DEFERRABLE");
+		COMPLETE_WITH("WORK", "TRANSACTION", "ISOLATION LEVEL", "READ", "DEFERRABLE", "NOT DEFERRABLE", "AFTER");
 /* END, ABORT */
 	else if (Matches("END|ABORT"))
 		COMPLETE_WITH("AND", "WORK", "TRANSACTION");
@@ -4610,7 +4610,7 @@ psql_completion(const char *text, int start, int end)
 
 /* START TRANSACTION */
 	else if (Matches("START"))
-		COMPLETE_WITH("TRANSACTION");
+		COMPLETE_WITH("TRANSACTION", "AFTER");
 
 /* TABLE, but not TABLE embedded in other commands */
 	else if (Matches("TABLE"))
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..ff4d63d5b0
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+extern void WaitForLSN(XLogRecPtr lsn, int millisecs);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+/* Shared memory structure */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index aadaf67f57..c2077b5252 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3527,6 +3527,8 @@ typedef struct TransactionStmt
 	/* for two-phase-commit related commands */
 	char	   *gid pg_node_attr(query_jumble_ignore);
 	bool		chain;			/* AND CHAIN option */
+	char	   *afterLsn;		/* target LSN to wait */
+	int			afterLsnTimeout;	/* LSN waiting timeout */
 	/* token location, or -1 if unknown */
 	int			location pg_node_attr(query_jumble_location);
 } TransactionStmt;
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..a1c2a0b13d 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_begin_after.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_begin_after.pl b/src/test/recovery/t/043_begin_after.pl
new file mode 100644
index 0000000000..e7859623b5
--- /dev/null
+++ b/src/test/recovery/t/043_begin_after.pl
@@ -0,0 +1,61 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	BEGIN AFTER '${lsn1}';
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary AFTER");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "BEGIN AFTER '$lsn1' WITHIN 1");
+$node_standby->psql(
+	'postgres',
+	"BEGIN AFTER '$lsn2' WITHIN 1",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index aa7a25b8f8..7a2ac178bc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3035,6 +3035,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
#32Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#31)

On Fri, Mar 15, 2024 at 10:32 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

On 2024-03-15 22:59, Kartyshov Ivan wrote:

On 2024-03-11 13:44, Alexander Korotkov wrote:

I picked the second option and left only the AFTER clause for the
BEGIN statement. I think this should be enough for the beginning.

Thank you for your rework on your patch, here I made some fixes:
0) autocomplete
1) less jumps
2) more description and add cases in doc

Thank you!

I think, it will be useful to have stand-alone statement.
Why you would like to see only AFTER clause for the BEGIN statement?

Yes, stand-alone statements might be also useful. But I think that
the best way for this feature to get into the core would be to commit
the minimal version first. The BEGIN clause has minimal invasiveness
for the syntax and I believe covers most typical use-cases. Once we
figure out it's OK and have positive feedback from users, we can do
more enchantments incrementally.

Rebase and update patch.

Cool, I was just about to ask you to do this.

------
Regards,
Alexander Korotkov

#33Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Alexander Korotkov (#32)

On Sat, Mar 16, 2024 at 2:04 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Rebase and update patch.

Thanks for working on this. I took a quick look at v11 patch. Here are
some comments:

1.
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"

Please place executor/spi.h in the alphabetical order. Also, look at
all other header files and place them in the order.

2. It seems like pgindent is not happy with
src/backend/access/transam/xlogrecovery.c and
src/backend/commands/waitlsn.c. Please run it to keep BF member koel
happy post commit.

3. This patch changes, SQL explicit transaction statement syntax, is
it (this deviation) okay from SQL standard perspective?

4. I think some more checks are needed for better input validations.

4.1 With invalid LSN succeeds, shouldn't it error out? Or at least,
add a fast path/quick exit to WaitForLSN()?
BEGIN AFTER '0/0';

4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?
BEGIN AFTER '0/FFFFFFFF';
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn';

4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max value,
say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 100000;

5.
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"

postgres.h must be included at the first, and then the system header
files, and then all postgres header files, just like below. See a very
recent commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=97d85be365443eb4bf84373a7468624762382059.

+#include "postgres.h"

+#include <float.h>
+#include <math.h>

+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
6.
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{

I see this patch is waking up all the waiters in the recovery path
after applying every WAL record, which IMO is a hot path. Is the
impact of this change on recovery measured, perhaps using
https://github.com/macdice/redo-bench or similar tools?

7. In continuation to comment #6, why not use Conditional Variables
instead of proc latches to sleep and wait for all the waiters in
WaitLSNSetLatches?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Alexander Korotkov (#28)

On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I went through this patch another time, and made some minor
adjustments. Now it looks good, I'm going to push it if no
objections.

I have a question related to usability, if the regular reads (say a
Select statement or reads via function/procedure) need a similar
guarantee to see the changes on standby then do they also always need
to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or
in other words, shouldn't we think of something for implicit
transactions?

In general, it seems this patch has been stuck for a long time on the
decision to choose an appropriate UI (syntax), and we thought of
moving it further so that the other parts of the patch can be
reviewed/discussed. So, I feel before pushing this we should see
comments from a few (at least two) other senior members who earlier
shared their opinion on the syntax. I know we don't have much time
left but OTOH pushing such a change (where we didn't have a consensus
on syntax) without much discussion at this point of time could lead to
discussions after commit.

--
With Regards,
Amit Kapila.

#35Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Amit Kapila (#34)

On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 15, 2024 at 7:50 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I went through this patch another time, and made some minor
adjustments. Now it looks good, I'm going to push it if no
objections.

I have a question related to usability, if the regular reads (say a
Select statement or reads via function/procedure) need a similar
guarantee to see the changes on standby then do they also always need
to first do something like "BEGIN AFTER '0/3F0FF791' WITHIN 1000;"? Or
in other words, shouldn't we think of something for implicit
transactions?

+1 to have support for implicit txns. A strawman solution I can think
of is to let primary send its current insert LSN to the standby every
time it sends a bunch of WAL, and the standby waits for that LSN to be
replayed on it at the start of every implicit txn automatically.

The new BEGIN syntax requires application code changes. This led me to
think how one can achieve read-after-write consistency today in a
primary - standby set up. All the logic of this patch, that is,
waiting for the standby to pass a given primary LSN needs to be done
in the application code (or in proxy or in load balancer?). I believe
there might be someone doing this already, it's good to hear from
them.

In general, it seems this patch has been stuck for a long time on the
decision to choose an appropriate UI (syntax), and we thought of
moving it further so that the other parts of the patch can be
reviewed/discussed. So, I feel before pushing this we should see
comments from a few (at least two) other senior members who earlier
shared their opinion on the syntax. I know we don't have much time
left but OTOH pushing such a change (where we didn't have a consensus
on syntax) without much discussion at this point of time could lead to
discussions after commit.

+1 to gain consensus first on the syntax changes. With this, we might
be violating the SQL standard for explicit txn commands (I stand for
correction about the SQL standard though).

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#36Alexander Korotkov
aekorotkov@gmail.com
In reply to: Bharath Rupireddy (#35)

Hi Amit,
Hi Bharath,

On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

In general, it seems this patch has been stuck for a long time on the
decision to choose an appropriate UI (syntax), and we thought of
moving it further so that the other parts of the patch can be
reviewed/discussed. So, I feel before pushing this we should see
comments from a few (at least two) other senior members who earlier
shared their opinion on the syntax. I know we don't have much time
left but OTOH pushing such a change (where we didn't have a consensus
on syntax) without much discussion at this point of time could lead to
discussions after commit.

+1 to gain consensus first on the syntax changes. With this, we might
be violating the SQL standard for explicit txn commands (I stand for
correction about the SQL standard though).

Thank you for your feedback. Generally, I agree it's correct to get
consensus on syntax first. And yes, this patch has been here since
2016. We didn't get consensus for syntax for 8 years. Frankly
speaking, I don't see a reason why this shouldn't take another 8
years. At the same time the ability to wait on standby given LSN is
replayed seems like pretty basic and simple functionality. Thus, it's
quite frustrating it already took that long and still unclear when/how
this could be finished.

My current attempt was to commit minimal implementation as less
invasive as possible. A new clause for BEGIN doesn't require
additional keywords and doesn't introduce additional statements. But
yes, this is still a new qual. And, yes, Amit you're right that even
if I had committed that, there was still a high risk of further
debates and revert.

Given my specsis about agreement over syntax, I'd like to check
another time if we could go without new syntax at all. There was an
attempt to implement waiting for lsn as a function. But function
holds a snapshot, which could prevent WAL records from being replayed.
Releasing a snapshot could break the parent query. But now we have
procedures, which need a dedicated statement for the call and can even
control transactions. Could we implement a waitlsn in a procedure
that:

1. First, check that it was called with non-atomic context (that is,
it's not called within a transaction). Trigger error if called with
atomic context.
2. Release a snapshot to be able to wait without risk of WAL replay
stuck. Procedure is still called within the snapshot. It's a bit of
a hack to release a snapshot, but Vacuum statements already do so.

Amit, Bharath, what do you think about this approach? Is this a way to go?

------
Regards,
Alexander Korotkov

#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Alexander Korotkov (#36)

On Sun, Mar 17, 2024 at 7:40 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

In general, it seems this patch has been stuck for a long time on the
decision to choose an appropriate UI (syntax), and we thought of
moving it further so that the other parts of the patch can be
reviewed/discussed. So, I feel before pushing this we should see
comments from a few (at least two) other senior members who earlier
shared their opinion on the syntax. I know we don't have much time
left but OTOH pushing such a change (where we didn't have a consensus
on syntax) without much discussion at this point of time could lead to
discussions after commit.

+1 to gain consensus first on the syntax changes. With this, we might
be violating the SQL standard for explicit txn commands (I stand for
correction about the SQL standard though).

Thank you for your feedback. Generally, I agree it's correct to get
consensus on syntax first. And yes, this patch has been here since
2016. We didn't get consensus for syntax for 8 years. Frankly
speaking, I don't see a reason why this shouldn't take another 8
years. At the same time the ability to wait on standby given LSN is
replayed seems like pretty basic and simple functionality. Thus, it's
quite frustrating it already took that long and still unclear when/how
this could be finished.

My current attempt was to commit minimal implementation as less
invasive as possible. A new clause for BEGIN doesn't require
additional keywords and doesn't introduce additional statements. But
yes, this is still a new qual. And, yes, Amit you're right that even
if I had committed that, there was still a high risk of further
debates and revert.

Given my specsis about agreement over syntax, I'd like to check
another time if we could go without new syntax at all. There was an
attempt to implement waiting for lsn as a function. But function
holds a snapshot, which could prevent WAL records from being replayed.
Releasing a snapshot could break the parent query. But now we have
procedures, which need a dedicated statement for the call and can even
control transactions. Could we implement a waitlsn in a procedure
that:

1. First, check that it was called with non-atomic context (that is,
it's not called within a transaction). Trigger error if called with
atomic context.
2. Release a snapshot to be able to wait without risk of WAL replay
stuck. Procedure is still called within the snapshot. It's a bit of
a hack to release a snapshot, but Vacuum statements already do so.

Can you please provide a bit more details with some example what is
the existing problem with functions and how using procedures will
resolve it? How will this this address the implicit transaction case
or do we have any other workaround for those cases?

--
With Regards,
Amit Kapila.

#38Alexander Korotkov
aekorotkov@gmail.com
In reply to: Amit Kapila (#37)
1 attachment(s)

On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Mar 17, 2024 at 7:40 PM Alexander Korotkov <aekorotkov@gmail.com>

wrote:

On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:

In general, it seems this patch has been stuck for a long time on

the

decision to choose an appropriate UI (syntax), and we thought of
moving it further so that the other parts of the patch can be
reviewed/discussed. So, I feel before pushing this we should see
comments from a few (at least two) other senior members who earlier
shared their opinion on the syntax. I know we don't have much time
left but OTOH pushing such a change (where we didn't have a

consensus

on syntax) without much discussion at this point of time could lead

to

discussions after commit.

+1 to gain consensus first on the syntax changes. With this, we might
be violating the SQL standard for explicit txn commands (I stand for
correction about the SQL standard though).

Thank you for your feedback. Generally, I agree it's correct to get
consensus on syntax first. And yes, this patch has been here since
2016. We didn't get consensus for syntax for 8 years. Frankly
speaking, I don't see a reason why this shouldn't take another 8
years. At the same time the ability to wait on standby given LSN is
replayed seems like pretty basic and simple functionality. Thus, it's
quite frustrating it already took that long and still unclear when/how
this could be finished.

My current attempt was to commit minimal implementation as less
invasive as possible. A new clause for BEGIN doesn't require
additional keywords and doesn't introduce additional statements. But
yes, this is still a new qual. And, yes, Amit you're right that even
if I had committed that, there was still a high risk of further
debates and revert.

Given my specsis about agreement over syntax, I'd like to check
another time if we could go without new syntax at all. There was an
attempt to implement waiting for lsn as a function. But function
holds a snapshot, which could prevent WAL records from being replayed.
Releasing a snapshot could break the parent query. But now we have
procedures, which need a dedicated statement for the call and can even
control transactions. Could we implement a waitlsn in a procedure
that:

1. First, check that it was called with non-atomic context (that is,
it's not called within a transaction). Trigger error if called with
atomic context.
2. Release a snapshot to be able to wait without risk of WAL replay
stuck. Procedure is still called within the snapshot. It's a bit of
a hack to release a snapshot, but Vacuum statements already do so.

Can you please provide a bit more details with some example what is
the existing problem with functions and how using procedures will
resolve it? How will this this address the implicit transaction case
or do we have any other workaround for those cases?

Please check [1] and [2] for the explanation of the problem with functions.

Also, please find a draft patch implementing the procedure. The issue with
the snapshot is addressed with the following lines.

We first ensure we're in a non-atomic context, then pop an active snapshot
(tricky, but ExecuteVacuum() does the same). Then we should have no active
snapshot and it's safe to wait for lsn replay.

if (context->atomic)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("pg_wait_lsn() must be only called in non-atomic
context")));

if (ActiveSnapshotSet())
PopActiveSnapshot();
Assert(!ActiveSnapshotSet());

The function call could be added either before the BEGIN statement or
before the implicit transaction.

CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;

Links
1.
/messages/by-id/CAPpHfduBSN8j5j5Ynn5x=ThD=8ypNd53D608VXGweBsPzxPvqA@mail.gmail.com
2.
/messages/by-id/CAPpHfdtiGgn0iS1KbW2HTam-1+oK+vhXZDAcnX9hKaA7Oe=F-A@mail.gmail.com

------
Regards,
Alexander Korotkov

Attachments:

v11-0001-Implement-AFTER-clause-for-BEGIN-command.patchapplication/octet-stream; name=v11-0001-Implement-AFTER-clause-for-BEGIN-command.patchDownload
From 462da94d1734a04cb55fc6f2792dfb87d8b8e9e7 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 5 Mar 2024 13:35:03 +0200
Subject: [PATCH v11] Implement AFTER clause for BEGIN command

The new clause is to be used on standby and specifies waiting for the specific
WAL location to be replayed before starting the transaction.   This option is
useful when the user makes some data changes on primary and needs a guarantee
to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin
---
 doc/src/sgml/ref/begin.sgml               |  44 ++-
 doc/src/sgml/ref/start_transaction.sgml   |   5 +-
 src/backend/access/transam/xlogrecovery.c |   7 +
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/waitlsn.c            | 327 ++++++++++++++++++++++
 src/backend/storage/ipc/ipci.c            |   7 +
 src/backend/storage/lmgr/proc.c           |   6 +
 src/include/catalog/pg_proc.dat           |   5 +
 src/include/commands/waitlsn.h            |  24 ++
 src/test/recovery/meson.build             |   1 +
 src/test/recovery/t/043_wait_lsn.pl       |  62 ++++
 src/tools/pgindent/typedefs.list          |   2 +
 13 files changed, 491 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wait_lsn.pl

diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b0214874..4ed7002683e 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN <replaceable class="parameter">number_of_milliseconds</replaceable> ]
 </synopsis>
  </refsynopsisdiv>
 
@@ -78,6 +81,36 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term><literal>AFTER</literal> <replaceable class="parameter">lsn_value</replaceable></term>
+    <listitem>
+     <para>
+       <command>AFTER</command> clause is used on standby in
+       <link linkend="streaming-replication">physical streaming replication</link>
+       and specifies waiting for the specific WAL location (<acronym>LSN</acronym>)
+       to be replayed before starting the transaction.
+     </para>
+     <para>
+       This option is useful when the user makes some data changes on primary
+       and needs a guarantee to see these changes on standby.  The LSN to wait
+       could be obtained on the primary using
+       <link linkend="functions-admin-backup"><function>pg_current_wal_insert_lsn</function></link>
+       function after committing the relevant changes.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>WITHIN</literal> <replaceable class="parameter">number_of_milliseconds</replaceable></term>
+    <listitem>
+     <para>
+       Provides the timeout for the <command>AFTER</command> clause.
+       Especially helpful to prevent freezing on streaming replication
+       connection failures.
+     </para>
+    </listitem>
+   </varlistentry>
   </variablelist>
 
   <para>
@@ -123,6 +156,15 @@ BEGIN [ WORK | TRANSACTION ] [ <replaceable class="parameter">transaction_mode</
 <programlisting>
 BEGIN;
 </programlisting></para>
+
+  <para>
+   To begin a transaction block after replaying the given <acronym>LSN</acronym>.
+   The command will be canceled if the given <acronym>LSN</acronym> is not
+   reached within the timeout of one second.
+<programlisting>
+BEGIN AFTER '0/3F0FF791' WITHIN 1000;
+</programlisting></para>
+
  </refsect1>
 
  <refsect1>
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e3456..46a3bcf1a80 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ]
+START TRANSACTION [ <replaceable class="parameter">transaction_mode</replaceable> [, ...] ] <replaceable class="parameter">wait_event</replaceable>
 
 <phrase>where <replaceable class="parameter">transaction_mode</replaceable> is one of:</phrase>
 
     ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
     READ WRITE | READ ONLY
     [ NOT ] DEFERRABLE
+
+<phrase>where <replaceable class="parameter">wait_event</replaceable> is:</phrase>
+    AFTER <replaceable class="parameter">lsn_value</replaceable> [ WITHIN number_of_milliseconds ]
 </synopsis>
  </refsynopsisdiv>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..c211230f6b5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,12 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..691e96439cc
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,327 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  BEGIN AFTER ... [ WITHIN ... ] clause.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include <float.h>
+#include <math.h>
+#include "postgres.h"
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "executor/spi.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+static WaitLSNState *state = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	state = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											 WaitLSNShmemSize(),
+											 &found);
+	if (!found)
+	{
+		SpinLockInit(&state->mutex);
+		state->numWaitedProcs = 0;
+		pg_atomic_init_u64(&state->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&state->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		if (state->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = state->procInfos[i];
+			state->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	state->procInfos[i] = cur;
+	state->numWaitedProcs++;
+
+	pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&state->mutex);
+
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		if (state->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < state->numWaitedProcs - 1)
+		{
+			state->procInfos[i] = state->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&state->mutex);
+		return;
+	}
+	state->numWaitedProcs--;
+
+	if (state->numWaitedProcs != 0)
+		pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&state->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&state->mutex);
+}
+
+/* Set all latches in shared memory to signal that new LSN has been replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	if (!state)
+		return;
+
+	/*
+	 * Fast check if we reached the LSN at least one process is waiting for.
+	 * This is important because saves us from acquiring the spinlock for
+	 * every WAL record replayed.
+	 */
+	if (pg_atomic_read_u64(&state->minLSN) > curLSN)
+		return;
+
+	SpinLockAcquire(&state->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < state->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (state->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(state->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < state->numWaitedProcs - numWakeUpProcs; i++)
+		state->procInfos[i] = state->procInfos[i + numWakeUpProcs];
+	state->numWaitedProcs -= numWakeUpProcs;
+
+	if (state->numWaitedProcs != 0)
+		pg_atomic_write_u64(&state->minLSN, state->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&state->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&state->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, int millisecs)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(state);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + millisecs * 1000;
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (millisecs > 0)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+
+			/* If no timeout is set then wake up in 1 minute for interrupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_CLIENT_READ);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr		trg_lsn = PG_GETARG_LSN(0);
+	uint64_t		delay = PG_GETARG_INT32(1);
+	CallContext	   *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wait_lsn() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(trg_lsn, delay);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919db..4b830dc3c85 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 700f7daf7b2..4cd85da1f9b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12128,6 +12128,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..97f50bc4c77
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "tcop/dest.h"
+
+extern void WaitForLSN(XLogRecPtr lsn, int millisecs);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..bc47c93902c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 00000000000..04b45eefec4
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,62 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary AFTER");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "CALL pg_wait_lsn('${lsn1}', 1);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wait_lsn('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index aa7a25b8f8c..7a2ac178bc3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3035,6 +3035,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Alexander Korotkov (#38)

On Mon, Mar 18, 2024 at 3:24 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

1. First, check that it was called with non-atomic context (that is,
it's not called within a transaction). Trigger error if called with
atomic context.
2. Release a snapshot to be able to wait without risk of WAL replay
stuck. Procedure is still called within the snapshot. It's a bit of
a hack to release a snapshot, but Vacuum statements already do so.

Can you please provide a bit more details with some example what is
the existing problem with functions and how using procedures will
resolve it? How will this this address the implicit transaction case
or do we have any other workaround for those cases?

Please check [1] and [2] for the explanation of the problem with functions.

Also, please find a draft patch implementing the procedure. The issue with the snapshot is addressed with the following lines.

We first ensure we're in a non-atomic context, then pop an active snapshot (tricky, but ExecuteVacuum() does the same). Then we should have no active snapshot and it's safe to wait for lsn replay.

if (context->atomic)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("pg_wait_lsn() must be only called in non-atomic context")));

if (ActiveSnapshotSet())
PopActiveSnapshot();
Assert(!ActiveSnapshotSet());

The function call could be added either before the BEGIN statement or before the implicit transaction.

CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;

I haven't thought in detail about whether there are any other problems
with this idea but sounds like it should solve the problems you shared
with a function call approach. BTW, if the application has to anyway
know the LSN till where replica needs to wait, why can't they simply
monitor the pg_last_wal_replay_lsn() value?

--
With Regards,
Amit Kapila.

#40Alexander Korotkov
aekorotkov@gmail.com
In reply to: Amit Kapila (#39)

On Tue, Mar 19, 2024 at 1:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 18, 2024 at 3:24 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Mar 18, 2024 at 5:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

1. First, check that it was called with non-atomic context (that is,
it's not called within a transaction). Trigger error if called with
atomic context.
2. Release a snapshot to be able to wait without risk of WAL replay
stuck. Procedure is still called within the snapshot. It's a bit of
a hack to release a snapshot, but Vacuum statements already do so.

Can you please provide a bit more details with some example what is
the existing problem with functions and how using procedures will
resolve it? How will this this address the implicit transaction case
or do we have any other workaround for those cases?

Please check [1] and [2] for the explanation of the problem with functions.

Also, please find a draft patch implementing the procedure. The issue with the snapshot is addressed with the following lines.

We first ensure we're in a non-atomic context, then pop an active snapshot (tricky, but ExecuteVacuum() does the same). Then we should have no active snapshot and it's safe to wait for lsn replay.

if (context->atomic)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("pg_wait_lsn() must be only called in non-atomic context")));

if (ActiveSnapshotSet())
PopActiveSnapshot();
Assert(!ActiveSnapshotSet());

The function call could be added either before the BEGIN statement or before the implicit transaction.

CALL pg_wait_lsn('my_lsn', my_timeout); BEGIN;
CALL pg_wait_lsn('my_lsn', my_timeout); SELECT ...;

I haven't thought in detail about whether there are any other problems
with this idea but sounds like it should solve the problems you shared
with a function call approach. BTW, if the application has to anyway
know the LSN till where replica needs to wait, why can't they simply
monitor the pg_last_wal_replay_lsn() value?

Amit, thank you for your feedback.

Yes, the application can monitor pg_last_wal_replay_lsn() value,
that's our state of the art solution. But that's rather inconvenient
and takes extra latency and network traffic. And it can't be wrapped
into a server-side function in procedural language for the reasons we
can't implement it as a built-in function.

------
Regards,
Alexander Korotkov

#41Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Alexander Korotkov (#40)
1 attachment(s)

Intro
==========
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside application. We cannot store this lsn inside database, since
reads are distributed across all replicas and primary.

Procedure style implementation
==========
/messages/by-id/27171.1586439221@sss.pgh.pa.us
/messages/by-id/20210121.173009.235021120161403875.horikyota.ntt@gmail.com

CALL pg_wait_lsn(‘LSN’, timeout);

Examples
==========

primary standby
------- --------
postgresql.conf
recovery_min_apply_delay = 10s

CREATE TABLE tbl AS SELECT generate_series(1,10) AS a;
INSERT INTO tbl VALUES (generate_series(11, 20));
SELECT pg_current_wal_lsn();

CALL pg_wait_lsn('0/3002AE8', 10000);
BEGIN;
SELECT * FROM tbl; // read fresh insertions
COMMIT;

Fixed and ready to review.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v12_0001-Procedure-wait-lsn.patchtext/x-diff; name=v12_0001-Procedure-wait-lsn.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5b225ccf4f..cb1c175c62 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27861,6 +27861,19 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         extension.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wait_lsn</primary>
+        </indexterm>
+        <function>pg_wait_lsn</function> (wait_lsn pg_lsn, timeout bigint)
+       </para>
+       <para>
+        If <parameter>timeout</parameter> <= 0 then timeout is off.
+        Returns ERROR if target <parameter>wait_lsn</parameter> was not replayed.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..0b783dc733 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..f9b701c946
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,309 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wait_lsn(wait_lsn pg_lsn, timeout int).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(waitLSN->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, uint64 millisecs)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	if (millisecs > 86400000)
+	{
+		millisecs = 86400000
+		elog(WARNING,"Timeout for pg_wait_lsn() restricted to 1 day");
+	}
+
+	endtime = GetCurrentTimestamp() + millisecs * 1000;
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (millisecs > 0)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+			/* If no timeout is set then wake up in 1 minute for interrupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	trg_lsn = PG_GETARG_LSN(0);
+	uint64_t	delay = PG_GETARG_INT32(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wait_lsn() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(trg_lsn, delay);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..5aed90c935 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919d..4b830dc3c8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 177d81a891..21222bf18c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12135,6 +12135,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..192e434d87
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structure */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+}			WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+}			WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, uint64 millisecs);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..bc47c93902 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 0000000000..04b45eefec
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,62 @@
+# Checks waiting for lsn on standby AFTER
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that AFTER works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that AFTER is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary AFTER");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "CALL pg_wait_lsn('${lsn1}', 1);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wait_lsn('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#42Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Bharath Rupireddy (#33)
1 attachment(s)

Bharath Rupireddy, thank you for you review.
But here is some points.

On 2024-03-16 10:02, Bharath Rupireddy wrote:

4.1 With invalid LSN succeeds, shouldn't it error out? Or at least,
add a fast path/quick exit to WaitForLSN()?
BEGIN AFTER '0/0';

In postgresql '0/0' is Valid pg_lsn, but it is always reached.

4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?
BEGIN AFTER '0/FFFFFFFF';
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn';

This case will give ERROR cause '0/FFFFFFFF' + 1 is invalid pg_lsn

4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max value,
say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 100000;

Good idea, I put it 1 day. But this limit we should to discuss.

6.
+/* Set all latches in shared memory to signal that new LSN has been 
replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{

I see this patch is waking up all the waiters in the recovery path
after applying every WAL record, which IMO is a hot path. Is the
impact of this change on recovery measured, perhaps using
https://github.com/macdice/redo-bench or similar tools?

7. In continuation to comment #6, why not use Conditional Variables
instead of proc latches to sleep and wait for all the waiters in
WaitLSNSetLatches?

Waiters are stored in the array sorted by LSN. This help us to wake
only PIDs with replayed LSN. This saves us from scanning of whole
array. So it`s not so hot path.

Add some fixes

1) make waiting timeont more simple (as pg_terminate_backend())
2) removed the 1 minute wait because INTERRUPTS don’t arrive for a
long time, changed it to 0.5 seconds
3) add more tests
4) added and expanded sections in the documentation
5) add default variant of timeout
pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)
6) now big timeout will be restricted to 1 day (86400000ms)
CALL pg_wait_lsn('0/34FB5A1',10000000000);
WARNING: Timeout for pg_wait_lsn() restricted to 1 day

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v13_0001-Procedure-wait-lsn.patchtext/x-diff; name=v13_0001-Procedure-wait-lsn.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5b225ccf4f..106bca9b73 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27861,6 +27861,19 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         extension.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wait_lsn</primary>
+        </indexterm>
+        <function>pg_wait_lsn</function> (trg_lsn pg_lsn, delay int8 DEFAULT 0)
+       </para>
+       <para>
+        If <parameter>timeout</parameter> <= 0 then timeout is off.
+        Returns ERROR if target <parameter>wait_lsn</parameter> was not replayed.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..0b783dc733 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46..ff82dffac0 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wait_lsn';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..b07d756f7a
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,313 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wait_lsn(wait_lsn pg_lsn, timeout int).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(waitLSN->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, uint64 millisecs)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	int64		waittime = 500;
+	int64		remainingtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	/* Bharat Rupireddy suggested limiting time-out to one day */
+	if (millisecs > 86400000)
+	{
+		millisecs = 86400000;
+		elog(WARNING,"Timeout for pg_wait_lsn() restricted to 1 day");
+	}
+
+	remainingtime = millisecs;
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		/* If no timeout is set then wake up in 1 minute for interrupts */
+		if (millisecs <= 0)
+			remainingtime = 60000;
+
+		if (remainingtime < waittime)
+			waittime = remainingtime;
+
+		if (waittime <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, waittime,
+					   WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+
+		remainingtime -= waittime;
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	trg_lsn = PG_GETARG_LSN(0);
+	uint64_t	delay = PG_GETARG_INT32(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wait_lsn() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(trg_lsn, delay);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..5aed90c935 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919d..4b830dc3c8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 177d81a891..21222bf18c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12135,6 +12135,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..192e434d87
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structure */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+}			WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+}			WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, uint64 millisecs);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..bc47c93902 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 0000000000..7110159756
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,77 @@
+# Checks waiting for lsn on standby pg_wait_lsn()
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary pg_wait_lsn()");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "CALL pg_wait_lsn('${lsn1}', 1);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wait_lsn('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if current LSN replayed.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn3 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn3}');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn3}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#43Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#42)

On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan <i.kartyshov@postgrespro.ru>
wrote:

4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?

I think limiting wait lsn by current received lsn would destroy the whole
value of this feature. The value is to wait till given LSN is replayed,
whether it's already received or not.

BEGIN AFTER '0/FFFFFFFF';
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn';

This case will give ERROR cause '0/FFFFFFFF' + 1 is invalid pg_lsn

FWIW,

# SELECT '0/FFFFFFFF'::pg_lsn + 1;
?column?
----------
1/0
(1 row)

But I don't see a problem here. On the replica, it's out of our control to
check which lsn is good and which is not. We can't check whether the lsn,
which is in future for the replica, is already issued by primary.

For the case of wrong lsn, which could cause potentially infinite wait,
there is the timeout and the manual query cancel.

4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max value,
say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 100000;

Good idea, I put it 1 day. But this limit we should to discuss.

Do you think that specifying timeout in milliseconds is suitable? I would
prefer to switch to seconds (with ability to specify fraction of second).
This was expressed before by Alexander Lakhin.

6.
+/* Set all latches in shared memory to signal that new LSN has been
replayed */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{

I see this patch is waking up all the waiters in the recovery path
after applying every WAL record, which IMO is a hot path. Is the
impact of this change on recovery measured, perhaps using
https://github.com/macdice/redo-bench or similar tools?

Ivan, could you do this?

7. In continuation to comment #6, why not use Conditional Variables
instead of proc latches to sleep and wait for all the waiters in
WaitLSNSetLatches?

Waiters are stored in the array sorted by LSN. This help us to wake
only PIDs with replayed LSN. This saves us from scanning of whole
array. So it`s not so hot path.

+1
This saves us from ConditionVariableBroadcast() every time we replay the
WAL record.

Add some fixes

1) make waiting timeont more simple (as pg_terminate_backend())
2) removed the 1 minute wait because INTERRUPTS don’t arrive for a
long time, changed it to 0.5 seconds

I don't see this change in the patch. Normally if a process gets a signal,
that causes WaitLatch() to exit immediately. It also exists immediately on
query cancel. IIRC, this 1 minute timeout is needed to handle some extreme
cases when an interrupt is missing. Other places have it equal to 1
minute. I don't see why we should have it different.

3) add more tests
4) added and expanded sections in the documentation

I don't see this in the patch. I see only a short description
in func.sgml, which is definitely not sufficient. We need at least
everything we have in the docs before to be adjusted with the current
approach of procedure.

5) add default variant of timeout
pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)

Does zero here mean no timeout? I think this should be documented. Also,
I would prefer to see the timeout by default. Probably one minute would be
good for default.

6) now big timeout will be restricted to 1 day (86400000ms)
CALL pg_wait_lsn('0/34FB5A1',10000000000);
WARNING: Timeout for pg_wait_lsn() restricted to 1 day

I don't think we need to mention individuals, who made proposals, in the
source code comments. Otherwise, our source code would be a crazy mess of
names. Also, if this is the restriction, it has to be an error. And it
should be a proper full ereport().

------
Regards,
Alexander Korotkov

#44Peter Eisentraut
peter@eisentraut.org
In reply to: Kartyshov Ivan (#41)

On 19.03.24 18:38, Kartyshov Ivan wrote:

             CALL pg_wait_lsn('0/3002AE8', 10000);
             BEGIN;
             SELECT * FROM tbl; // read fresh insertions
             COMMIT;

I'm not endorsing this or any other approach, but I think the timeout
parameter should be of type interval, not an integer with a unit that is
hidden in the documentation.

#45Peter Eisentraut
peter@eisentraut.org
In reply to: Alexander Korotkov (#36)

On 17.03.24 15:09, Alexander Korotkov wrote:

My current attempt was to commit minimal implementation as less
invasive as possible. A new clause for BEGIN doesn't require
additional keywords and doesn't introduce additional statements. But
yes, this is still a new qual. And, yes, Amit you're right that even
if I had committed that, there was still a high risk of further
debates and revert.

I had written in [0]/messages/by-id/8b5b172f-0ae7-d644-8358-e2851dded43b@enterprisedb.com about my questions related to using this with
connection poolers. I don't think this was addressed at all. I haven't
seen any discussion about how to make this kind of facility usable in a
full system. You have to manually query and send LSNs; that seems
pretty cumbersome. Sure, this is part of something that could be
useful, but how would an actual user with actual application code get to
use this?

[0]: /messages/by-id/8b5b172f-0ae7-d644-8358-e2851dded43b@enterprisedb.com
/messages/by-id/8b5b172f-0ae7-d644-8358-e2851dded43b@enterprisedb.com

#46Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Peter Eisentraut (#45)
1 attachment(s)

On 2024-03-20 12:11, Alexander Korotkov wrote:

On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?

I think limiting wait lsn by current received lsn would destroy the
whole value of this feature. The value is to wait till given LSN is
replayed, whether it's already received or not.

Ok sounds reasonable, I`ll rollback the changes.

But I don't see a problem here. On the replica, it's out of our
control to check which lsn is good and which is not. We can't check
whether the lsn, which is in future for the replica, is already issued
by primary.

For the case of wrong lsn, which could cause potentially infinite
wait, there is the timeout and the manual query cancel.

Fully agree with this take.

4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max

value,

say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 100000;

Good idea, I put it 1 day. But this limit we should to discuss.

Do you think that specifying timeout in milliseconds is suitable? I
would prefer to switch to seconds (with ability to specify fraction of
second). This was expressed before by Alexander Lakhin.

It sounds like an interesting idea. Please review the result.

https://github.com/macdice/redo-bench or similar tools?

Ivan, could you do this?

Yes, test redo-bench/crash-recovery.sh
This patch on master
91.327, 1.973
105.907, 3.338
98.412, 4.579
95.818, 4.19

REL_13-STABLE
116.645, 3.005
113.212, 2.568
117.644, 3.183
111.411, 2.782

master
124.712, 2.047
117.012, 1.736
116.328, 2.035
115.662, 1.797

Strange behavior, patched version is faster then REL_13-STABLE and
master.

I don't see this change in the patch. Normally if a process gets a
signal, that causes WaitLatch() to exit immediately. It also exists
immediately on query cancel. IIRC, this 1 minute timeout is needed to
handle some extreme cases when an interrupt is missing. Other places
have it equal to 1 minute. I don't see why we should have it
different.

Ok, I`ll rollback my changes.

4) added and expanded sections in the documentation

I don't see this in the patch. I see only a short description in
func.sgml, which is definitely not sufficient. We need at least
everything we have in the docs before to be adjusted with the current
approach of procedure.

I didn't find another section where to add the description of
pg_wait_lsn().
So I extend description on the bottom of the table.

5) add default variant of timeout
pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)

Does zero here mean no timeout? I think this should be documented.
Also, I would prefer to see the timeout by default. Probably one
minute would be good for default.

Lets discuss this point. Loop in function WaitForLSN is made that way,
if we choose delay=0, only then we can wait infinitely to wait LSN
without timeout. So default must be 0.

Please take one more look on the patch.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v14_0001-Procedure-wait-lsn.patchtext/x-diff; name=v14_0001-Procedure-wait-lsn.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5b225ccf4f..6e1f69962d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27861,10 +27861,54 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         extension.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wait_lsn</primary>
+        </indexterm>
+        <function>pg_wait_lsn</function> (trg_lsn pg_lsn, delay int8 DEFAULT 0)
+       </para>
+       <para>
+        Returns ERROR if target LSN was not replayed on standby.
+        Parameter <parameter>delay</parameter> sets seconds, the time to wait to the LSN.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
 
+   <para>
+    <function>pg_wait_lsn</function> waits till <parameter>wait_lsn</parameter>
+    to be replayed on standby to achieve read-your-writes-consistency, while
+    using async replica for reads and primary for writes. In that case lsn of
+    last modification is stored inside application.
+   </para>
+
+   <para>
+    You can use <function>pg_wait_lsn</function> to wait the <type>pg_lsn</type>
+    value. For example:
+   <programlisting>
+   Replica:
+   postgresql.conf
+     recovery_min_apply_delay = 10s;
+
+   Primary: Application update table and get lsn of changes
+   postgres=# UPDATE films SET kind = 'Dramatic' WHERE kind = 'Drama';
+   postgres=# SELECT pg_current_wal_lsn();
+   pg_current_wal_lsn
+   --------------------
+   0/306EE20
+   (1 row)
+
+   Replica: Wait and read updated data
+   postgres=# CALL pg_wait_lsn('0/306EE20', 100); // Timeout 100ms is insufficent
+   ERROR:  canceling waiting for LSN due to timeout
+   postgres=# CALL pg_wait_lsn('0/306EE20');
+   postgres=# SELECT * FROM films WHERE kind = 'Drama';
+   </programlisting>
+   </para>
+
    <para>
     The functions shown in <xref
     linkend="functions-recovery-control-table"/> control the progress of recovery.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..0b783dc733 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46..5c4f9c78a1 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wait_lsn(trg_lsn pg_lsn, delay float8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wait_lsn';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..bd9ea6b7db
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wait_lsn(wait_lsn pg_lsn, timeout int).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(waitLSN->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, float8 sec)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + (int64)(sec * 1000 * 1000);
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (sec >= 1)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+			/* If no timeout is set then wake up in 1 minute for interrupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	trg_lsn = PG_GETARG_LSN(0);
+	float8		delay = PG_GETARG_FLOAT8(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wait_lsn() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(trg_lsn, delay);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..5aed90c935 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919d..4b830dc3c8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 177d81a891..0d9c012e52 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12135,6 +12135,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn float8', proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..53ab9b7b6b
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structure */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+}			WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+}			WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, float8 sec);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..bc47c93902 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 0000000000..1875702c93
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,77 @@
+# Checks waiting for lsn on standby pg_wait_lsn()
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 3;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary pg_wait_lsn()");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "CALL pg_wait_lsn('${lsn1}', 1);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wait_lsn('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if current LSN replayed.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn3 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn3}');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn3}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#47Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Kartyshov Ivan (#46)
1 attachment(s)

Thank you for your feedback.

On 2024-03-20 12:11, Alexander Korotkov wrote:

On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?

I think limiting wait lsn by current received lsn would destroy the
whole value of this feature. The value is to wait till given LSN is
replayed, whether it's already received or not.

Ok sounds reasonable, I`ll rollback the changes.

But I don't see a problem here. On the replica, it's out of our
control to check which lsn is good and which is not. We can't check
whether the lsn, which is in future for the replica, is already issued
by primary.

For the case of wrong lsn, which could cause potentially infinite
wait, there is the timeout and the manual query cancel.

Fully agree with this take.

4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max

value,

say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 100000;

Good idea, I put it 1 day. But this limit we should to discuss.

Do you think that specifying timeout in milliseconds is suitable? I
would prefer to switch to seconds (with ability to specify fraction of
second). This was expressed before by Alexander Lakhin.

It sounds like an interesting idea. Please review the result.

https://github.com/macdice/redo-bench or similar tools?

Ivan, could you do this?

Yes, test redo-bench/crash-recovery.sh
This patch on master
91.327, 1.973
105.907, 3.338
98.412, 4.579
95.818, 4.19

REL_13-STABLE
116.645, 3.005
113.212, 2.568
117.644, 3.183
111.411, 2.782

master
124.712, 2.047
117.012, 1.736
116.328, 2.035
115.662, 1.797

Strange behavior, patched version is faster then REL_13-STABLE and
master.

I don't see this change in the patch. Normally if a process gets a
signal, that causes WaitLatch() to exit immediately. It also exists
immediately on query cancel. IIRC, this 1 minute timeout is needed to
handle some extreme cases when an interrupt is missing. Other places
have it equal to 1 minute. I don't see why we should have it
different.

Ok, I`ll rollback my changes.

4) added and expanded sections in the documentation

I don't see this in the patch. I see only a short description in
func.sgml, which is definitely not sufficient. We need at least
everything we have in the docs before to be adjusted with the current
approach of procedure.

I didn't find another section where to add the description of
pg_wait_lsn().
So I extend description on the bottom of the table.

5) add default variant of timeout
pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)

Does zero here mean no timeout? I think this should be documented.
Also, I would prefer to see the timeout by default. Probably one
minute would be good for default.

Lets discuss this point. Loop in function WaitForLSN is made that way,
if we choose delay=0, only then we can wait infinitely to wait LSN
without timeout. So default must be 0.

Please take one more look on the patch.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v14_0001-Procedure-wait-lsn.patchtext/x-diff; charset=us-ascii; name=v14_0001-Procedure-wait-lsn.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5b225ccf4f..6e1f69962d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27861,10 +27861,54 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         extension.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wait_lsn</primary>
+        </indexterm>
+        <function>pg_wait_lsn</function> (trg_lsn pg_lsn, delay int8 DEFAULT 0)
+       </para>
+       <para>
+        Returns ERROR if target LSN was not replayed on standby.
+        Parameter <parameter>delay</parameter> sets seconds, the time to wait to the LSN.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
 
+   <para>
+    <function>pg_wait_lsn</function> waits till <parameter>wait_lsn</parameter>
+    to be replayed on standby to achieve read-your-writes-consistency, while
+    using async replica for reads and primary for writes. In that case lsn of
+    last modification is stored inside application.
+   </para>
+
+   <para>
+    You can use <function>pg_wait_lsn</function> to wait the <type>pg_lsn</type>
+    value. For example:
+   <programlisting>
+   Replica:
+   postgresql.conf
+     recovery_min_apply_delay = 10s;
+
+   Primary: Application update table and get lsn of changes
+   postgres=# UPDATE films SET kind = 'Dramatic' WHERE kind = 'Drama';
+   postgres=# SELECT pg_current_wal_lsn();
+   pg_current_wal_lsn
+   --------------------
+   0/306EE20
+   (1 row)
+
+   Replica: Wait and read updated data
+   postgres=# CALL pg_wait_lsn('0/306EE20', 100); // Timeout 100ms is insufficent
+   ERROR:  canceling waiting for LSN due to timeout
+   postgres=# CALL pg_wait_lsn('0/306EE20');
+   postgres=# SELECT * FROM films WHERE kind = 'Drama';
+   </programlisting>
+   </para>
+
    <para>
     The functions shown in <xref
     linkend="functions-recovery-control-table"/> control the progress of recovery.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..0b783dc733 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46..5c4f9c78a1 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wait_lsn(trg_lsn pg_lsn, delay float8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wait_lsn';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..bd9ea6b7db
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wait_lsn(wait_lsn pg_lsn, timeout int).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(waitLSN->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, float8 sec)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + (int64)(sec * 1000 * 1000);
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (sec >= 1)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+			/* If no timeout is set then wake up in 1 minute for interrupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	trg_lsn = PG_GETARG_LSN(0);
+	float8		delay = PG_GETARG_FLOAT8(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wait_lsn() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(trg_lsn, delay);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..5aed90c935 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919d..4b830dc3c8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 177d81a891..0d9c012e52 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12135,6 +12135,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn float8', proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..53ab9b7b6b
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structure */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+}			WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+}			WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, float8 sec);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..bc47c93902 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 0000000000..1875702c93
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,77 @@
+# Checks waiting for lsn on standby pg_wait_lsn()
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 3;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary pg_wait_lsn()");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "CALL pg_wait_lsn('${lsn1}', 1);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wait_lsn('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if current LSN replayed.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn3 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn3}');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn3}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#48Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Peter Eisentraut (#45)
1 attachment(s)

Thank you for your feedback.

On 2024-03-20 12:11, Alexander Korotkov wrote:

On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?

I think limiting wait lsn by current received lsn would destroy the
whole value of this feature. The value is to wait till given LSN is
replayed, whether it's already received or not.

Ok sounds reasonable, I`ll rollback the changes.

But I don't see a problem here. On the replica, it's out of our
control to check which lsn is good and which is not. We can't check
whether the lsn, which is in future for the replica, is already issued
by primary.

For the case of wrong lsn, which could cause potentially infinite
wait, there is the timeout and the manual query cancel.

Fully agree with this take.

4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max

value,

say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 100000;

Good idea, I put it 1 day. But this limit we should to discuss.

Do you think that specifying timeout in milliseconds is suitable? I
would prefer to switch to seconds (with ability to specify fraction of
second). This was expressed before by Alexander Lakhin.

It sounds like an interesting idea. Please review the result.

https://github.com/macdice/redo-bench or similar tools?

Ivan, could you do this?

Yes, test redo-bench/crash-recovery.sh
This patch on master
91.327, 1.973
105.907, 3.338
98.412, 4.579
95.818, 4.19

REL_13-STABLE
116.645, 3.005
113.212, 2.568
117.644, 3.183
111.411, 2.782

master
124.712, 2.047
117.012, 1.736
116.328, 2.035
115.662, 1.797

Strange behavior, patched version is faster then REL_13-STABLE and
master.

I don't see this change in the patch. Normally if a process gets a
signal, that causes WaitLatch() to exit immediately. It also exists
immediately on query cancel. IIRC, this 1 minute timeout is needed to
handle some extreme cases when an interrupt is missing. Other places
have it equal to 1 minute. I don't see why we should have it
different.

Ok, I`ll rollback my changes.

4) added and expanded sections in the documentation

I don't see this in the patch. I see only a short description in
func.sgml, which is definitely not sufficient. We need at least
everything we have in the docs before to be adjusted with the current
approach of procedure.

I didn't find another section where to add the description of
pg_wait_lsn().
So I extend description on the bottom of the table.

5) add default variant of timeout
pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0)

Does zero here mean no timeout? I think this should be documented.
Also, I would prefer to see the timeout by default. Probably one
minute would be good for default.

Lets discuss this point. Loop in function WaitForLSN is made that way,
if we choose delay=0, only then we can wait infinitely to wait LSN
without timeout. So default must be 0.

Please take one more look on the patch.

PS sorry, the strange BUG throw my mails out of thread.
/messages/by-id/f2ff071aa9141405bb8efee67558a058@postgrespro.ru

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v14_0001-Procedure-wait-lsn.patchtext/x-diff; name=v14_0001-Procedure-wait-lsn.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5b225ccf4f..6e1f69962d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27861,10 +27861,54 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         extension.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wait_lsn</primary>
+        </indexterm>
+        <function>pg_wait_lsn</function> (trg_lsn pg_lsn, delay int8 DEFAULT 0)
+       </para>
+       <para>
+        Returns ERROR if target LSN was not replayed on standby.
+        Parameter <parameter>delay</parameter> sets seconds, the time to wait to the LSN.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
 
+   <para>
+    <function>pg_wait_lsn</function> waits till <parameter>wait_lsn</parameter>
+    to be replayed on standby to achieve read-your-writes-consistency, while
+    using async replica for reads and primary for writes. In that case lsn of
+    last modification is stored inside application.
+   </para>
+
+   <para>
+    You can use <function>pg_wait_lsn</function> to wait the <type>pg_lsn</type>
+    value. For example:
+   <programlisting>
+   Replica:
+   postgresql.conf
+     recovery_min_apply_delay = 10s;
+
+   Primary: Application update table and get lsn of changes
+   postgres=# UPDATE films SET kind = 'Dramatic' WHERE kind = 'Drama';
+   postgres=# SELECT pg_current_wal_lsn();
+   pg_current_wal_lsn
+   --------------------
+   0/306EE20
+   (1 row)
+
+   Replica: Wait and read updated data
+   postgres=# CALL pg_wait_lsn('0/306EE20', 100); // Timeout 100ms is insufficent
+   ERROR:  canceling waiting for LSN due to timeout
+   postgres=# CALL pg_wait_lsn('0/306EE20');
+   postgres=# SELECT * FROM films WHERE kind = 'Drama';
+   </programlisting>
+   </para>
+
    <para>
     The functions shown in <xref
     linkend="functions-recovery-control-table"/> control the progress of recovery.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..0b783dc733 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46..5c4f9c78a1 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wait_lsn(trg_lsn pg_lsn, delay float8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wait_lsn';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..bd9ea6b7db
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wait_lsn(wait_lsn pg_lsn, timeout int).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(waitLSN->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, float8 sec)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + (int64)(sec * 1000 * 1000);
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (sec >= 1)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+			/* If no timeout is set then wake up in 1 minute for interrupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	trg_lsn = PG_GETARG_LSN(0);
+	float8		delay = PG_GETARG_FLOAT8(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wait_lsn() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(trg_lsn, delay);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..5aed90c935 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919d..4b830dc3c8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 177d81a891..0d9c012e52 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12135,6 +12135,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn float8', proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..53ab9b7b6b
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structure */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+}			WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+}			WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, float8 sec);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..bc47c93902 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 0000000000..1875702c93
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,77 @@
+# Checks waiting for lsn on standby pg_wait_lsn()
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 3;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary pg_wait_lsn()");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "CALL pg_wait_lsn('${lsn1}', 1);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wait_lsn('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if current LSN replayed.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn3 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn3}');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn3}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#49Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Peter Eisentraut (#45)

On Fri, Mar 22, 2024 at 4:28 AM Peter Eisentraut <peter@eisentraut.org> wrote:

I had written in [0] about my questions related to using this with
connection poolers. I don't think this was addressed at all. I haven't
seen any discussion about how to make this kind of facility usable in a
full system. You have to manually query and send LSNs; that seems
pretty cumbersome. Sure, this is part of something that could be
useful, but how would an actual user with actual application code get to
use this?

[0]:
/messages/by-id/8b5b172f-0ae7-d644-8358-e2851dded43b@enterprisedb.com

I share the same concern as yours and had proposed something upthread
[1]: /messages/by-id/CALj2ACUfS7LH1PaWmSZ5KwH4BpQxO9izeMw4qC3a1DAwi6nfbQ@mail.gmail.com
beginning of txn/query (depending on isolation level), the same way
the standby can wait for the primary's current LSN as of the moment
(at the time of taking snapshot). And, primary keeps sending its
current LSN as part of regular WAL to standbys so that the standbys
doesn't have to make connections to the primary to know its current
LSN every time. Perhps, this may not even fully guarantee (considered
to be achieving) the read-after-write consistency on standbys unless
there's a way for the application to tell the wait LSN.

Thoughts?

[1]: /messages/by-id/CALj2ACUfS7LH1PaWmSZ5KwH4BpQxO9izeMw4qC3a1DAwi6nfbQ@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#50Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Bharath Rupireddy (#49)
1 attachment(s)

Thank you for your interest to the patch.
I understand you questions, but I fully support Alexander Korotkov idea
to commit the minimal required functionality. And then keep working on
other improvements.

On 2024-03-24 05:39, Bharath Rupireddy wrote:

On Fri, Mar 22, 2024 at 4:28 AM Peter Eisentraut <peter@eisentraut.org>
wrote:

I had written in [0] about my questions related to using this with
connection poolers. I don't think this was addressed at all. I
haven't
seen any discussion about how to make this kind of facility usable in
a
full system. You have to manually query and send LSNs; that seems
pretty cumbersome. Sure, this is part of something that could be
useful, but how would an actual user with actual application code get
to
use this?

[0]:
/messages/by-id/8b5b172f-0ae7-d644-8358-e2851dded43b@enterprisedb.com

But I wonder how a client is going to get the LSN. How would all of
this be used by a client? I can think of a scenarios where you have
an application that issues a bunch of SQL commands and you have some
kind of pooler in the middle that redirects those commands to
different hosts, and what you really want is to have it transparently
behave as if it's just a single host. Do we want to inject a bunch
of "SELECT pg_get_lsn()", "SELECT pg_wait_lsn()" calls into that?

As I understand your question, application make dml on the primary
server, get LSN of changes and send bunch SQL read-only commands to
pooler. Transparent behave we can get using #synchronous_commit, but
it is very slow.

I'm tempted to think this could be a protocol-layer facility. Every
query automatically returns the current LSN, and every query can also
send along an LSN to wait for, and the client library would just keep
track of the LSN for (what it thinks of as) the connection. So you
get some automatic serialization without having to modify your client
code.

Thank you, it is a good question for future versions.
You say about a protocol-layer facility, what you meen. May be we can
use signals, like hot_standby_feedback.

I share the same concern as yours and had proposed something upthread
[1]. The idea is something like how each query takes a snapshot at the
beginning of txn/query (depending on isolation level), the same way
the standby can wait for the primary's current LSN as of the moment
(at the time of taking snapshot). And, primary keeps sending its
current LSN as part of regular WAL to standbys so that the standbys
doesn't have to make connections to the primary to know its current
LSN every time. Perhps, this may not even fully guarantee (considered
to be achieving) the read-after-write consistency on standbys unless
there's a way for the application to tell the wait LSN.

Thoughts?

[1]
/messages/by-id/CALj2ACUfS7LH1PaWmSZ5KwH4BpQxO9izeMw4qC3a1DAwi6nfbQ@mail.gmail.com

+1 to have support for implicit txns. A strawman solution I can think
of is to let primary send its current insert LSN to the standby every
time it sends a bunch of WAL, and the standby waits for that LSN to be
replayed on it at the start of every implicit txn automatically.

And how standby will get lsn to wait for? All solutions I can think of
are very invasive and poorly scalable.

For example, every dml can send back LSN if dml is success. And
application could use it to wait actual changes.

The new BEGIN syntax requires application code changes. This led me to
think how one can achieve read-after-write consistency today in a
primary - standby set up. All the logic of this patch, that is, waiting
for the standby to pass a given primary LSN needs to be done in the
application code (or in proxy or in load balancer?). I believe there
might be someone doing this already, it's good to hear from them.

You may use #synchronous_commit mode but it slow. So my implementation
don`t make primary to wait all standby to sent its feedbacks.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v14_0002-Procedure-wait-lsn.patchtext/x-diff; name=v14_0002-Procedure-wait-lsn.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8ecc02f2b9..8042201bab 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28086,10 +28086,55 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         extension.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wait_lsn</primary>
+        </indexterm>
+        <function>pg_wait_lsn</function> (trg_lsn pg_lsn, delay int8 DEFAULT 0)
+       </para>
+       <para>
+        Returns ERROR if target LSN was not replayed on standby.
+        Parameter <parameter>delay</parameter> sets seconds, the time to wait to the LSN.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
 
+   <para>
+    <function>pg_wait_lsn</function> waits till <parameter>wait_lsn</parameter>
+    to be replayed on standby to achieve read-your-writes-consistency, while
+    using async replica for reads and primary for writes. In that case lsn of
+    last modification is stored inside application.
+    Note: pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0)
+   </para>
+
+   <para>
+    You can use <function>pg_wait_lsn</function> to wait the <type>pg_lsn</type>
+    value. For example:
+   <programlisting>
+   Replica:
+   postgresql.conf
+     recovery_min_apply_delay = 10s;
+
+   Primary: Application update table and get lsn of changes
+   postgres=# UPDATE films SET kind = 'Dramatic' WHERE kind = 'Drama';
+   postgres=# SELECT pg_current_wal_lsn();
+   pg_current_wal_lsn
+   --------------------
+   0/306EE20
+   (1 row)
+
+   Replica: Wait and read updated data
+   postgres=# CALL pg_wait_lsn('0/306EE20', 100); // Timeout 100ms is insufficent
+   ERROR:  canceling waiting for LSN due to timeout
+   postgres=# CALL pg_wait_lsn('0/306EE20');
+   postgres=# SELECT * FROM films WHERE kind = 'Drama';
+   </programlisting>
+   </para>
+
    <para>
     The functions shown in <xref
     linkend="functions-recovery-control-table"/> control the progress of recovery.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..0b783dc733 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46..5c4f9c78a1 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wait_lsn(trg_lsn pg_lsn, delay float8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wait_lsn';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..bd9ea6b7db
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wait_lsn(wait_lsn pg_lsn, timeout int).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(waitLSN->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, float8 sec)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = GetCurrentTimestamp() + (int64)(sec * 1000 * 1000);
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (sec >= 1)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+			/* If no timeout is set then wake up in 1 minute for interrupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wait_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	trg_lsn = PG_GETARG_LSN(0);
+	float8		delay = PG_GETARG_FLOAT8(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wait_lsn() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(trg_lsn, delay);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..5aed90c935 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919d..4b830dc3c8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0d26e5b422..cb325b3be0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12138,6 +12138,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN until timeout',
+  proname => 'pg_wait_lsn', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn float8', proargnames => '{trg_lsn,delay}',
+  prosrc => 'pg_wait_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..53ab9b7b6b
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structure */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+}			WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+}			WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, float8 sec);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..bc47c93902 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 0000000000..1875702c93
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,77 @@
+# Checks waiting for lsn on standby pg_wait_lsn()
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 3;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if given no wait timeout.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary pg_wait_lsn()");
+
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "CALL pg_wait_lsn('${lsn1}', 1);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wait_lsn('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Make sure that pg_wait_lsn() works: add new content to primary and memorize
+# primary's new LSN, then wait for primary's LSN on standby. Prove that pg_wait_lsn() is
+# able to setup an infinite waiting loop and exit it if current LSN replayed.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn3 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_lsn('${lsn3}');
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn3}'::pg_lsn);
+]);
+
+# Get the current LSN on standby and make sure it's the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
#51Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Eisentraut (#44)

On Fri, Mar 22, 2024 at 12:50 AM Peter Eisentraut <peter@eisentraut.org> wrote:

On 19.03.24 18:38, Kartyshov Ivan wrote:

CALL pg_wait_lsn('0/3002AE8', 10000);
BEGIN;
SELECT * FROM tbl; // read fresh insertions
COMMIT;

I'm not endorsing this or any other approach, but I think the timeout
parameter should be of type interval, not an integer with a unit that is
hidden in the documentation.

I'm not sure a timeout needs to deal with complexity of our interval
datatype. At the same time, the integer number of milliseconds looks
a bit weird. Could the float8 number of seconds be an option?

------
Regards,
Alexander Korotkov

#52Alexander Korotkov
aekorotkov@gmail.com
In reply to: Peter Eisentraut (#45)

On Fri, Mar 22, 2024 at 12:58 AM Peter Eisentraut <peter@eisentraut.org> wrote:

On 17.03.24 15:09, Alexander Korotkov wrote:

My current attempt was to commit minimal implementation as less
invasive as possible. A new clause for BEGIN doesn't require
additional keywords and doesn't introduce additional statements. But
yes, this is still a new qual. And, yes, Amit you're right that even
if I had committed that, there was still a high risk of further
debates and revert.

I had written in [0] about my questions related to using this with
connection poolers. I don't think this was addressed at all. I haven't
seen any discussion about how to make this kind of facility usable in a
full system. You have to manually query and send LSNs; that seems
pretty cumbersome. Sure, this is part of something that could be
useful, but how would an actual user with actual application code get to
use this?

The current usage pattern of this functionality is the following.

1. Do the write transaction on primary
2. Query pg_current_wal_insert_lsn() on primary
3. Call pg_wait_lsn() with the value obtained on the previous step on replica
4. Do the read transaction of replica

This usage pattern could be implemented either on the application
level, or on the pooler level. For application level, it would
require a somewhat advanced level of database-aware programming, but
this is still a valid usage. Regarding poolers, if some poolers
manage to automatically distinguish reading and writing queries,
dealing with LSNs wouldn't be too complex for them.

Having this functionality on protocol level would be ideal, but let's
do this step-by-step. The built-in procedure isn't very invasive, but
that could give us some adoption and open the way forward.

------
Regards,
Alexander Korotkov

#53Alexander Korotkov
aekorotkov@gmail.com
In reply to: Bharath Rupireddy (#49)

On Sun, Mar 24, 2024 at 4:39 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I share the same concern as yours and had proposed something upthread
[1]. The idea is something like how each query takes a snapshot at the
beginning of txn/query (depending on isolation level), the same way
the standby can wait for the primary's current LSN as of the moment
(at the time of taking snapshot). And, primary keeps sending its
current LSN as part of regular WAL to standbys so that the standbys
doesn't have to make connections to the primary to know its current
LSN every time. Perhps, this may not even fully guarantee (considered
to be achieving) the read-after-write consistency on standbys unless
there's a way for the application to tell the wait LSN.

Oh, no. Please, check [1]. The idea is to wait for a particular
transaction to become visible. The one who made a change on primary
brings the lsn value from there to replica. For instance, an
application made a change on primary and then willing to run some
report on replica. And the report should be guaranteed to contain the
change just made. So, the application query the LSN from primary
after making a write transaction, then calls pg_wait_lsn() on
replicate before running the report.

This is quite simple server functionality, which could be used at
application-level, ORM-level or pooler-level. And it unlocks the way
forward for in-protocol implementation as proposed by Peter
Eisentraut.

Links.
1. /messages/by-id/CAPpHfdtny81end69PzEdRsROKnsybsj=Os8DUM-6HeKGKnCuQQ@mail.gmail.com

------
Regards,
Alexander Korotkov

#54Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#50)

On Tue, Mar 26, 2024 at 4:06 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

Thank you for your interest to the patch.
I understand you questions, but I fully support Alexander Korotkov idea
to commit the minimal required functionality. And then keep working on
other improvements.

I did further improvements in the patch.

Notably, I decided to rename the procedure to
pg_wait_for_wal_replay_lsn(). This makes the name look consistent
with other WAL-related functions. Also it clearly states that we're
waiting for lsn to be replayed (not received, written or flushed).

Also, I did implements in the docs, commit message and some minor code fixes.

I'm continuing to look at this patch.

------
Regards,
Alexander Korotkov

#55Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#54)
1 attachment(s)

On Thu, Mar 28, 2024 at 2:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Mar 26, 2024 at 4:06 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

Thank you for your interest to the patch.
I understand you questions, but I fully support Alexander Korotkov idea
to commit the minimal required functionality. And then keep working on
other improvements.

I did further improvements in the patch.

Notably, I decided to rename the procedure to
pg_wait_for_wal_replay_lsn(). This makes the name look consistent
with other WAL-related functions. Also it clearly states that we're
waiting for lsn to be replayed (not received, written or flushed).

Also, I did implements in the docs, commit message and some minor code fixes.

I'm continuing to look at this patch.

Sorry, I forgot the attachment.

------
Regards,
Alexander Korotkov

Attachments:

v12-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patchapplication/octet-stream; name=v12-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patchDownload
From 732109c42e2be45de154102e78c617904492c35c Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 28 Mar 2024 02:19:50 +0200
Subject: [PATCH v12] Implement pg_wait_for_wal_replay_lsn() stored procedure

pg_wait_for_wal_replay_lsn is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy
---
 doc/src/sgml/func.sgml                    |  96 +++++++
 src/backend/access/transam/xlogrecovery.c |  10 +
 src/backend/catalog/system_functions.sql  |   3 +
 src/backend/commands/Makefile             |   3 +-
 src/backend/commands/meson.build          |   1 +
 src/backend/commands/waitlsn.c            | 303 ++++++++++++++++++++++
 src/backend/storage/ipc/ipci.c            |   7 +
 src/backend/storage/lmgr/proc.c           |   6 +
 src/include/catalog/pg_proc.dat           |   5 +
 src/include/commands/waitlsn.h            |  43 +++
 src/test/recovery/meson.build             |   1 +
 src/test/recovery/t/043_wait_lsn.pl       |  77 ++++++
 src/tools/pgindent/typedefs.list          |   2 +
 13 files changed, 556 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wait_lsn.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8ecc02f2b90..69dc53f24c4 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28227,6 +28227,102 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    There are also procedures to control the progress of recovery.
+    They are shown in <xref linkend="procedures-recovery-control-table"/>.
+    These procedures may be executed only during recovery.
+   </para>
+
+   <table id="procedures-recovery-control-table">
+    <title>Recovery Control Procedures</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wait_for_wal_replay_lsn</primary>
+        </indexterm>
+        <function>pg_wait_for_wal_replay_lsn</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>float8</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
+        on standby within given timeout.  Parameter <parameter>timeout</parameter>
+        is the time in seconds to wait for the<parameter>target_lsn</parameter>  replay.
+        When <parameter>timeout</parameter> value equals to zero no timeout
+        is applied.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wait_for_wal_replay_lsn</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wait_for_wal_replay_lsn</function> to wait the <type>pg_lsn</type>
+    value.  For example, an application could update the
+    <literal>movie</literal> table and get the <acronym>lsn</acronym> of
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    to get the <acronym>lsn</acronym> given that <varname>synchronous_commit</varname>
+    could be set to <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wait_for_wal_replay_lsn</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wait_for_wal_replay_lsn('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wait_lsn('0/306EE20', 0.1);
+ERROR:  canceling waiting for LSN due to timeout
+   </programlisting>
+
+   </para>
+
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..509616c2749 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46d..6ca29c28fa1 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wait_for_wal_replay_lsn(target_lsn pg_lsn, timeout float8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wait_for_wal_replay_lsn';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..80ac9fc268c
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,303 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wait_lsn(wait_lsn pg_lsn, timeout int).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	uint32		i,
+				numWakeUpProcs;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Set latches for process, whose waited LSNs are already replayed.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		backend = GetPGProcByNumber(waitLSN->procInfos[i].procnum);
+		SetLatch(&backend->procLatch);
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch to wait till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, float8 timeout)
+{
+	XLogRecPtr	curLSN;
+	int			latch_events;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = TimestampTzPlusSeconds(GetCurrentTimestamp(), timeout);
+
+	latch_events = WL_TIMEOUT | WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (timeout > 0.0)
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+		else
+			/* If no timeout is set then wake up in 1 minute for interrupts */
+			delay_ms = 60000;
+
+		if (delay_ms <= 0)
+			break;
+
+		/*
+		 * If received an interruption from CHECK_FOR_INTERRUPTS, then delete
+		 * the current event from array.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wait_for_wal_replay_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	float8		timeout = PG_GETARG_FLOAT8(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wait_lsn() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919db..4b830dc3c85 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2f7cfc02c6d..e611e21762b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12138,6 +12138,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN with timeout',
+  proname => 'pg_wait_for_wal_replay_lsn', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn float8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wait_for_wal_replay_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..8c1d8838348
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structures */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, float8 sec);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..bc47c93902c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 00000000000..2b5b05b34ee
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,77 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wait_for_wal_replay_lsn() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 3;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wait_for_wal_replay_lsn() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby. 
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_for_wal_replay_lsn('${lsn1}', 1000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby and is the same as primary's LSN
+ok($output eq 0, "standby reached the same LSN as primary after pg_wait_for_wal_replay_lsn()");
+
+# Check that waiting for unreachable LSN triggers the timeout.
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres', "CALL pg_wait_for_wal_replay_lsn('${lsn1}', 0.01);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wait_for_wal_replay_lsn('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Check that new data is visible after calling pg_wait_for_wal_replay_lsn()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn3 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_for_wal_replay_lsn('${lsn3}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and is the same as primary's LSN
+ok($output eq 30, "standby reached the same LSN as primary");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..f35ea6212b1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3053,6 +3053,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#56Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Korotkov (#55)

v12

Hi all,

I didn't review the patch but one thing jumped out: I don't think it's
OK to hold a spinlock while (1) looping over an array of backends and
(2) making system calls (SetLatch()).

#57Alexander Korotkov
aekorotkov@gmail.com
In reply to: Thomas Munro (#56)
1 attachment(s)

On Thu, Mar 28, 2024 at 9:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:

v12

Hi all,

I didn't review the patch but one thing jumped out: I don't think it's
OK to hold a spinlock while (1) looping over an array of backends and
(2) making system calls (SetLatch()).

Good catch, thank you.

Fixed along with other issues spotted by Alexander Lakhin.

------
Regards,
Alexander Korotkov

Attachments:

v13-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patchapplication/octet-stream; name=v13-0001-Implement-pg_wait_for_wal_replay_lsn-stored-proc.patchDownload
From 1af8810bc426a581ab144b943c4428feb991c25f Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 28 Mar 2024 02:19:50 +0200
Subject: [PATCH v13] Implement pg_wait_for_wal_replay_lsn() stored procedure

pg_wait_for_wal_replay_lsn is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy
---
 doc/src/sgml/func.sgml                        | 109 ++++++
 src/backend/access/transam/xlogrecovery.c     |  10 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 311 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   7 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/commands/waitlsn.h                |  43 +++
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wait_lsn.pl           |  82 +++++
 src/tools/pgindent/typedefs.list              |   2 +
 14 files changed, 583 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wait_lsn.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8ecc02f2b90..e94c0f44cac 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28227,6 +28227,115 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    There are also procedures to control the progress of recovery.
+    They are shown in <xref linkend="procedures-recovery-control-table"/>.
+    These procedures may be executed only during recovery.
+   </para>
+
+   <table id="procedures-recovery-control-table">
+    <title>Recovery Control Procedures</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wait_for_wal_replay_lsn</primary>
+        </indexterm>
+        <function>pg_wait_for_wal_replay_lsn</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>float8</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
+        on standby within given timeout.  Parameter <parameter>timeout</parameter>
+        is the time in seconds to wait for the <parameter>target_lsn</parameter>
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wait_for_wal_replay_lsn</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wait_for_wal_replay_lsn</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    to get the <acronym>lsn</acronym> given that <varname>synchronous_commit</varname>
+    could be set to <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wait_for_wal_replay_lsn</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wait_for_wal_replay_lsn('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wait_for_wal_replay_lsn('0/306EE20', 0.1);
+ERROR:  canceling waiting for LSN due to timeout
+   </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wait_for_wal_replay_lsn</function> can't be used within
+     the transaction, another procedure or function.  Any of them holds a
+     snapshot, which could prevent the replay of WAL records.
+
+   <programlisting>
+postgres=# begin;
+BEGIN
+postgres=*# CALL pg_wait_for_wal_replay_lsn('0/306EE20');
+ERROR:  pg_wait_for_wal_replay_lsn() must be only called in non-atomic context
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..509616c2749 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46d..6ca29c28fa1 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wait_for_wal_replay_lsn(target_lsn pg_lsn, timeout float8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wait_for_wal_replay_lsn';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..c036b27e654
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,311 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wait_for_wal_replay_lsn(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	int			i;
+	int		   *wakeUpProcNums;
+	int			numWakeUpProcs;
+
+	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Remember processes, whose waited LSNs are already replayed.  We should
+	 * set their latches later after spinlock release.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		wakeUpProcNums[i] = waitLSN->procInfos[i].procnum;
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		backend = GetPGProcByNumber(wakeUpProcNums[i]);
+		SetLatch(&backend->procLatch);
+	}
+	pfree(wakeUpProcNums);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, float8 timeout)
+{
+	XLogRecPtr	curLSN;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = TimestampTzPlusSeconds(GetCurrentTimestamp(), timeout);
+
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		int			latch_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (timeout > 0.0)
+		{
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+			latch_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wait_for_wal_replay_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	float8		timeout = PG_GETARG_FLOAT8(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wait_for_wal_replay_lsn() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919db..4b830dc3c85 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d1..79140d14116 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -79,6 +79,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2f7cfc02c6d..e611e21762b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12138,6 +12138,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN with timeout',
+  proname => 'pg_wait_for_wal_replay_lsn', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn float8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wait_for_wal_replay_lsn' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..8c1d8838348
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structures */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, float8 sec);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..bc47c93902c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 00000000000..b77f6cb37cf
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,82 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wait_for_wal_replay_lsn() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 3;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wait_for_wal_replay_lsn() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_for_wal_replay_lsn('${lsn1}', 1000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok( $output >= 0,
+	"standby reached the same LSN as primary after pg_wait_for_wal_replay_lsn()"
+);
+
+# Check that waiting for unreachable LSN triggers the timeout.
+my $lsn2 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres',
+	"CALL pg_wait_for_wal_replay_lsn('${lsn1}', 0.01);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wait_for_wal_replay_lsn('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Check that new data is visible after calling pg_wait_for_wal_replay_lsn()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn3 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wait_for_wal_replay_lsn('${lsn3}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and is the same as primary's LSN
+ok($output eq 30, "standby reached the same LSN as primary");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..f35ea6212b1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3053,6 +3053,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#58Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#46)

On Fri, Mar 22, 2024 at 3:45 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

On 2024-03-20 12:11, Alexander Korotkov wrote:

On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

4.2 With an unreasonably high future LSN, BEGIN command waits
unboundedly, shouldn't we check if the specified LSN is more than
pg_last_wal_receive_lsn() error out?

I think limiting wait lsn by current received lsn would destroy the
whole value of this feature. The value is to wait till given LSN is
replayed, whether it's already received or not.

Ok sounds reasonable, I`ll rollback the changes.

But I don't see a problem here. On the replica, it's out of our
control to check which lsn is good and which is not. We can't check
whether the lsn, which is in future for the replica, is already issued
by primary.

For the case of wrong lsn, which could cause potentially infinite
wait, there is the timeout and the manual query cancel.

Fully agree with this take.

4.3 With an unreasonably high wait time, BEGIN command waits
unboundedly, shouldn't we restrict the wait time to some max

value,

say a day or so?
SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset
BEGIN AFTER :'future_receive_lsn' WITHIN 100000;

Good idea, I put it 1 day. But this limit we should to discuss.

Do you think that specifying timeout in milliseconds is suitable? I
would prefer to switch to seconds (with ability to specify fraction of
second). This was expressed before by Alexander Lakhin.

It sounds like an interesting idea. Please review the result.

https://github.com/macdice/redo-bench or similar tools?

Ivan, could you do this?

Yes, test redo-bench/crash-recovery.sh
This patch on master
91.327, 1.973
105.907, 3.338
98.412, 4.579
95.818, 4.19

REL_13-STABLE
116.645, 3.005
113.212, 2.568
117.644, 3.183
111.411, 2.782

master
124.712, 2.047
117.012, 1.736
116.328, 2.035
115.662, 1.797

Strange behavior, patched version is faster then REL_13-STABLE and
master.

I've run this test on my machine with v13 of the path.

patched
53.663, 0.466
53.884, 0.402
54.102, 0.441

master
55.216, 0.441
54.52, 0.464
51.479, 0.438

It seems that difference is less than variance.

------
Regards,
Alexander Korotkov

#59Euler Taveira
euler@eulerto.com
In reply to: Alexander Korotkov (#57)

On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:

Fixed along with other issues spotted by Alexander Lakhin.

[I didn't read the whole thread. I'm sorry if I missed something ...]

You renamed the function in a previous version but let me suggest another one:
pg_wal_replay_wait. It uses the same pattern as the other recovery control
functions [1]https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-RECOVERY-CONTROL. I think "for" doesn't add much for the function name and "lsn" is
used in functions that return an LSN (that's not the case here).

postgres=# \df pg_wal_replay*
List of functions
-[ RECORD 1 ]-------+---------------------
Schema | pg_catalog
Name | pg_wal_replay_pause
Result data type | void
Argument data types |
Type | func
-[ RECORD 2 ]-------+---------------------
Schema | pg_catalog
Name | pg_wal_replay_resume
Result data type | void
Argument data types |
Type | func

Regarding the arguments, I think the timeout should be bigint. There is at least
another function that implements a timeout that uses bigint.

postgres=# \df pg_terminate_backend
List of functions
-[ RECORD 1 ]-------+--------------------------------------
Schema | pg_catalog
Name | pg_terminate_backend
Result data type | boolean
Argument data types | pid integer, timeout bigint DEFAULT 0
Type | func

I also suggests that the timeout unit should be milliseconds, hence, using
bigint is perfectly fine for the timeout argument.

+       <para>
+        Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
+        on standby within given timeout.  Parameter <parameter>timeout</parameter>
+        is the time in seconds to wait for the <parameter>target_lsn</parameter>
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.
+       </para></entry>

[1]: https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-RECOVERY-CONTROL

--
Euler Taveira
EDB https://www.enterprisedb.com/

#60Alexander Korotkov
aekorotkov@gmail.com
In reply to: Euler Taveira (#59)
1 attachment(s)

Hi, Euler!

On Fri, Mar 29, 2024 at 1:38 AM Euler Taveira <euler@eulerto.com> wrote:

On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:

Fixed along with other issues spotted by Alexander Lakhin.

[I didn't read the whole thread. I'm sorry if I missed something ...]

You renamed the function in a previous version but let me suggest another one:
pg_wal_replay_wait. It uses the same pattern as the other recovery control
functions [1]. I think "for" doesn't add much for the function name and "lsn" is
used in functions that return an LSN (that's not the case here).

postgres=# \df pg_wal_replay*
List of functions
-[ RECORD 1 ]-------+---------------------
Schema | pg_catalog
Name | pg_wal_replay_pause
Result data type | void
Argument data types |
Type | func
-[ RECORD 2 ]-------+---------------------
Schema | pg_catalog
Name | pg_wal_replay_resume
Result data type | void
Argument data types |
Type | func

Makes sense to me. I tried to make a new procedure name consistent
with functions acquiring various WAL positions. But you're right,
it's better to be consistent with other functions controlling wal
replay.

Regarding the arguments, I think the timeout should be bigint. There is at least
another function that implements a timeout that uses bigint.

postgres=# \df pg_terminate_backend
List of functions
-[ RECORD 1 ]-------+--------------------------------------
Schema | pg_catalog
Name | pg_terminate_backend
Result data type | boolean
Argument data types | pid integer, timeout bigint DEFAULT 0
Type | func

I also suggests that the timeout unit should be milliseconds, hence, using
bigint is perfectly fine for the timeout argument.

+       <para>
+        Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
+        on standby within given timeout.  Parameter <parameter>timeout</parameter>
+        is the time in seconds to wait for the <parameter>target_lsn</parameter>
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.
+       </para></entry>

This generally makes sense, but I'm not sure about this. The
milliseconds timeout was used initially but received critics in [1].

Links.
1. /messages/by-id/b45ff979-9d12-4828-a22a-e4cb327e115c@eisentraut.org

------
Regards,
Alexander Korotkov

Attachments:

v14-0001-Implement-pg_wal_replay_wait-stored-procedure.patchapplication/octet-stream; name=v14-0001-Implement-pg_wal_replay_wait-stored-procedure.patchDownload
From d598ae4897c96f84860fb722f5fc47d8db69f832 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 28 Mar 2024 02:19:50 +0200
Subject: [PATCH v14] Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy
---
 doc/src/sgml/func.sgml                        | 109 ++++++
 src/backend/access/transam/xlogrecovery.c     |  10 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 311 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   7 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/commands/waitlsn.h                |  43 +++
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wait_lsn.pl           |  81 +++++
 src/tools/pgindent/typedefs.list              |   2 +
 14 files changed, 582 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wait_lsn.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 93b0bc2bc6e..f1e889cf30a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28260,6 +28260,115 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    There are also procedures to control the progress of recovery.
+    They are shown in <xref linkend="procedures-recovery-control-table"/>.
+    These procedures may be executed only during recovery.
+   </para>
+
+   <table id="procedures-recovery-control-table">
+    <title>Recovery Control Procedures</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>float8</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
+        on standby within given timeout.  Parameter <parameter>timeout</parameter>
+        is the time in seconds to wait for the <parameter>target_lsn</parameter>
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    to get the <acronym>lsn</acronym> given that <varname>synchronous_commit</varname>
+    could be set to <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 0.1);
+ERROR:  canceling waiting for LSN due to timeout
+   </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can't be used within
+     the transaction, another procedure or function.  Any of them holds a
+     snapshot, which could prevent the replay of WAL records.
+
+   <programlisting>
+postgres=# begin;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+ERROR:  pg_wal_replay_wait() must be only called in non-atomic context
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..509616c2749 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46d..5656ce0f0fb 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout float8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..493d526ed8c
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,311 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	int			i;
+	int		   *wakeUpProcNums;
+	int			numWakeUpProcs;
+
+	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Remember processes, whose waited LSNs are already replayed.  We should
+	 * set their latches later after spinlock release.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		wakeUpProcNums[i] = waitLSN->procInfos[i].procnum;
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		backend = GetPGProcByNumber(wakeUpProcNums[i]);
+		SetLatch(&backend->procLatch);
+	}
+	pfree(wakeUpProcNums);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, float8 timeout)
+{
+	XLogRecPtr	curLSN;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = TimestampTzPlusSeconds(GetCurrentTimestamp(), timeout);
+
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		int			latch_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (timeout > 0.0)
+		{
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+			latch_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* If postmaster dies, finish immediately */
+		if (!PostmasterIsAlive())
+			break;
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	float8		timeout = PG_GETARG_FLOAT8(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919db..4b830dc3c85 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d1..79140d14116 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -79,6 +79,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 07023ee61d9..e9c61d37aeb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12150,6 +12150,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN with timeout',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn float8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..8c1d8838348
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structures */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, float8 sec);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..bc47c93902c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 00000000000..d6c49446511
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,81 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 3;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# Check that waiting for unreachable LSN triggers the timeout.
+my $lsn2 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 1");
+my $stderr;
+$node_standby->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn1}', 0.01);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 1);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn3 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn3}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and is the same as primary's LSN
+ok($output eq 30, "standby reached the same LSN as primary");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..f35ea6212b1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3053,6 +3053,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#61Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Alexander Korotkov (#60)

Hi, hackers!

On Fri, 29 Mar 2024 at 16:45, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Hi, Euler!

On Fri, Mar 29, 2024 at 1:38 AM Euler Taveira <euler@eulerto.com> wrote:

On Thu, Mar 28, 2024, at 9:39 AM, Alexander Korotkov wrote:

Fixed along with other issues spotted by Alexander Lakhin.

[I didn't read the whole thread. I'm sorry if I missed something ...]

You renamed the function in a previous version but let me suggest

another one:

pg_wal_replay_wait. It uses the same pattern as the other recovery

control

functions [1]. I think "for" doesn't add much for the function name and

"lsn" is

used in functions that return an LSN (that's not the case here).

postgres=# \df pg_wal_replay*
List of functions
-[ RECORD 1 ]-------+---------------------
Schema | pg_catalog
Name | pg_wal_replay_pause
Result data type | void
Argument data types |
Type | func
-[ RECORD 2 ]-------+---------------------
Schema | pg_catalog
Name | pg_wal_replay_resume
Result data type | void
Argument data types |
Type | func

Makes sense to me. I tried to make a new procedure name consistent
with functions acquiring various WAL positions. But you're right,
it's better to be consistent with other functions controlling wal
replay.

Regarding the arguments, I think the timeout should be bigint. There is

at least

another function that implements a timeout that uses bigint.

postgres=# \df pg_terminate_backend
List of functions
-[ RECORD 1 ]-------+--------------------------------------
Schema | pg_catalog
Name | pg_terminate_backend
Result data type | boolean
Argument data types | pid integer, timeout bigint DEFAULT 0
Type | func

I also suggests that the timeout unit should be milliseconds, hence,

using

bigint is perfectly fine for the timeout argument.

+       <para>
+        Throws an ERROR if the target <acronym>lsn</acronym> was not

replayed

+ on standby within given timeout. Parameter

<parameter>timeout</parameter>

+ is the time in seconds to wait for the

<parameter>target_lsn</parameter>

+ replay. When <parameter>timeout</parameter> value equals to

zero no

+ timeout is applied.
+ </para></entry>

This generally makes sense, but I'm not sure about this. The
milliseconds timeout was used initially but received critics in [1].

I see in Postgres we already have different units for timeouts:

e.g in guc's:
wal_receiver_status_interval in seconds
tcp_keepalives_idle in seconds

commit_delay in microseconds

deadlock_timeout in milliseconds
max_standby_archive_delay in milliseconds
vacuum_cost_delay in milliseconds
autovacuum_vacuum_cost_delay in milliseconds
etc..

I haven't counted precisely, but I feel that milliseconds are the most
often used in both guc's and functions. So I'd propose using milliseconds
for the patch as it was proposed originally.

Regards,
Pavel Borisov
Supabase.

#62Euler Taveira
euler@eulerto.com
In reply to: Alexander Korotkov (#60)

On Fri, Mar 29, 2024, at 9:44 AM, Alexander Korotkov wrote:

This generally makes sense, but I'm not sure about this. The
milliseconds timeout was used initially but received critics in [1].

Alexander, I see why you changed the patch.

Peter suggested to use an interval but you proposed another data type:
float. The advantage of the interval data type is that you don't need to
carefully think about the unit, however, if you use the integer data
type you have to propose one. (If that's the case, milliseconds is a
good granularity for this feature.) I don't have a strong preference
between integer and interval data types but I don't like the float for
this case. The 2 main reasons are (a) that we treat time units (hours,
minutes, seconds, ...) as integers so it seems natural for a human being
to use a unit time as integer and (b) depending on the number of digits
after the decimal separator you still don't have an integer in the
internal unit, hence, you have to round it to integer.

We already have functions that use integer (such as pg_terminate_backend)
and interval (such as pg_sleep_for) and if i searched correctly it will
be the first timeout argument as float.

--
Euler Taveira
EDB https://www.enterprisedb.com/

#63Alexander Korotkov
aekorotkov@gmail.com
In reply to: Euler Taveira (#62)
1 attachment(s)

On Fri, Mar 29, 2024 at 6:50 PM Euler Taveira <euler@eulerto.com> wrote:

On Fri, Mar 29, 2024, at 9:44 AM, Alexander Korotkov wrote:

This generally makes sense, but I'm not sure about this. The
milliseconds timeout was used initially but received critics in [1].

Alexander, I see why you changed the patch.

Peter suggested to use an interval but you proposed another data type:
float. The advantage of the interval data type is that you don't need to
carefully think about the unit, however, if you use the integer data
type you have to propose one. (If that's the case, milliseconds is a
good granularity for this feature.) I don't have a strong preference
between integer and interval data types but I don't like the float for
this case. The 2 main reasons are (a) that we treat time units (hours,
minutes, seconds, ...) as integers so it seems natural for a human being
to use a unit time as integer and (b) depending on the number of digits
after the decimal separator you still don't have an integer in the
internal unit, hence, you have to round it to integer.

We already have functions that use integer (such as pg_terminate_backend)
and interval (such as pg_sleep_for) and if i searched correctly it will
be the first timeout argument as float.

Thank you for the detailed explanation. Float seconds are used in
pg_sleep() just similar to the interval in pg_sleep_for(). However,
that's a delay, not exactly a timeout. Given the precedent of
milliseconds timeout in pg_terminate_backend(), your and Pavel's
points, I've switched back to integer milliseconds timeout.

Some fixes spotted off-list by Alexander Lakhin.
1) We don't need an explicit check for the postmaster being alive as
soon as we pass WL_EXIT_ON_PM_DEATH to WaitLatch().
2) When testing for unreachable LSN, we need to select LSN well in
advance so that autovacuum couldn't affect that.

I'm going to push this if no objections.

------
Regards,
Alexander Korotkov

Attachments:

v15-0001-Implement-pg_wal_replay_wait-stored-procedure.patchapplication/octet-stream; name=v15-0001-Implement-pg_wal_replay_wait-stored-procedure.patchDownload
From c165d067c3fb42153f9cd3ac48a39559550e6090 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 28 Mar 2024 02:19:50 +0200
Subject: [PATCH v15] Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
---
 doc/src/sgml/func.sgml                        | 110 +++++++
 src/backend/access/transam/xlogrecovery.c     |  10 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 307 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   7 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/commands/waitlsn.h                |  43 +++
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wait_lsn.pl           |  83 +++++
 src/tools/pgindent/typedefs.list              |   2 +
 14 files changed, 581 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wait_lsn.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 93b0bc2bc6e..e5ee7ef2e87 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28260,6 +28260,116 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    There are also procedures to control the progress of recovery.
+    They are shown in <xref linkend="procedures-recovery-control-table"/>.
+    These procedures may be executed only during recovery.
+   </para>
+
+   <table id="procedures-recovery-control-table">
+    <title>Recovery Control Procedures</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
+        on standby within given timeout.  Parameter
+        <parameter>timeout</parameter> is the time in milliseconds to wait
+        for the <parameter>target_lsn</parameter>
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    to get the <acronym>lsn</acronym> given that <varname>synchronous_commit</varname>
+    could be set to <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 100);
+ERROR:  canceling waiting for LSN due to timeout
+   </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can't be used within
+     the transaction, another procedure or function.  Any of them holds a
+     snapshot, which could prevent the replay of WAL records.
+
+   <programlisting>
+postgres=# begin;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+ERROR:  pg_wal_replay_wait() must be only called in non-atomic context
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..509616c2749 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46d..a79bb80c951 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..093fcb4f84e
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,307 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set all latches in shared memory to signal that new LSN has been replayed
+*/
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	int			i;
+	int		   *wakeUpProcNums;
+	int			numWakeUpProcs;
+
+	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Remember processes, whose waited LSNs are already replayed.  We should
+	 * set their latches later after spinlock release.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		wakeUpProcNums[i] = waitLSN->procInfos[i].procnum;
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		backend = GetPGProcByNumber(wakeUpProcNums[i]);
+		SetLatch(&backend->procLatch);
+	}
+	pfree(wakeUpProcNums);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr lsn, int64 timeout)
+{
+	XLogRecPtr	curLSN;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+
+	addLSNWaiter(lsn);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		int			latch_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (lsn <= curLSN)
+			break;
+
+		if (timeout > 0.0)
+		{
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+			latch_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (lsn > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	int64		timeout = PG_GETARG_INT64(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919db..4b830dc3c85 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d1..79140d14116 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -79,6 +79,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 07023ee61d9..91533b30b87 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12150,6 +12150,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN with timeout',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..368d9a8883c
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structures */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr lsn, int64 timeout);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..bc47c93902c 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wait_lsn.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wait_lsn.pl b/src/test/recovery/t/043_wait_lsn.pl
new file mode 100644
index 00000000000..ed67f6cdb37
--- /dev/null
+++ b/src/test/recovery/t/043_wait_lsn.pl
@@ -0,0 +1,83 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 3;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn2 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn1}', 10);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 1000);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn3 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn3}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and is the same as primary's LSN
+ok($output eq 30, "standby reached the same LSN as primary");
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8d7bed411f..f3f6dae1900 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3054,6 +3054,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#64Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Alexander Korotkov (#63)

Thank you Alexander for working on patch, may be we should change some
names:
1) test 043_wait_lsn.pl -> to 043_waitlsn.pl like waitlsn.c and
waitlsn.h

In waitlsn.c and waitlsn.h variables:
2) targret_lsn -> trgLSN like curLSN

3) lsn -> trgLSN like curLSN

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

#65Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#64)
1 attachment(s)

Hi!

On Sat, Mar 30, 2024 at 6:14 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

Thank you Alexander for working on patch, may be we should change some
names:
1) test 043_wait_lsn.pl -> to 043_waitlsn.pl like waitlsn.c and
waitlsn.h

I renamed that to 043_wal_replay_wait.pl to match the name of SQL procedure.

In waitlsn.c and waitlsn.h variables:
2) targret_lsn -> trgLSN like curLSN

I prefer this to match the SQL procedure parameter name.

3) lsn -> trgLSN like curLSN

Done.

Also I implemented termination of wal replay waits on standby
promotion (with test).

In the test I change recovery_min_apply_delay to 1s in order to make
the test pass faster.

------
Regards,
Alexander Korotkov

Attachments:

v16-0001-Implement-pg_wal_replay_wait-stored-procedure.patchapplication/octet-stream; name=v16-0001-Implement-pg_wal_replay_wait-stored-procedure.patchDownload
From 33fa509b8685f52ff88c5a0b0bc2b56dcc4b19dd Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 28 Mar 2024 02:19:50 +0200
Subject: [PATCH v16] Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
---
 doc/src/sgml/func.sgml                        | 110 ++++++
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  10 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 320 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   7 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/commands/waitlsn.h                |  43 +++
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wal_replay_wait.pl    |  97 ++++++
 src/tools/pgindent/typedefs.list              |   2 +
 15 files changed, 615 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wal_replay_wait.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b694b2883c3..aba94741c6c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28284,6 +28284,116 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    There are also procedures to control the progress of recovery.
+    They are shown in <xref linkend="procedures-recovery-control-table"/>.
+    These procedures may be executed only during recovery.
+   </para>
+
+   <table id="procedures-recovery-control-table">
+    <title>Recovery Control Procedures</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Throws an ERROR if the target <acronym>lsn</acronym> was not replayed
+        on standby within given timeout.  Parameter
+        <parameter>timeout</parameter> is the time in milliseconds to wait
+        for the <parameter>target_lsn</parameter>
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    to get the <acronym>lsn</acronym> given that <varname>synchronous_commit</varname>
+    could be set to <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 100);
+ERROR:  canceling waiting for LSN due to timeout
+   </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can't be used within
+     the transaction, another procedure or function.  Any of them holds a
+     snapshot, which could prevent the replay of WAL records.
+
+   <programlisting>
+postgres=# begin;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+ERROR:  pg_wal_replay_wait() must be only called in non-atomic context
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 20a5f862090..1446639ea09 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/waitlsn.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -6040,6 +6041,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before achieving the target LSN.
+	 */
+	WaitLSNSetLatches(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..0b3d586ccda 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,15 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk over
+			 * the shared memory array and set latches to notify the waiters.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46d..a79bb80c951 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..cf372dabb15
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,320 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
+ * LSN waiters when InvalidXLogRecPtr is given.
+ */
+void
+WaitLSNSetLatches(XLogRecPtr curLSN)
+{
+	int			i;
+	int		   *wakeUpProcNums;
+	int			numWakeUpProcs;
+
+	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Remember processes, whose waited LSNs are already replayed.  We should
+	 * set their latches later after spinlock release.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (!XLogRecPtrIsInvalid(curLSN) &&
+			waitLSN->procInfos[i].waitLSN > curLSN)
+			break;
+
+		wakeUpProcNums[i] = waitLSN->procInfos[i].procnum;
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		backend = GetPGProcByNumber(wakeUpProcNums[i]);
+		SetLatch(&backend->procLatch);
+	}
+	pfree(wakeUpProcNums);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr trgLSN, int64 timeout)
+{
+	XLogRecPtr	curLSN;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	/* If target LSN is already replayed, exit immediately */
+	if (trgLSN <= GetXLogReplayRecPtr(NULL))
+		return;
+
+	endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+
+	addLSNWaiter(trgLSN);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		int			latch_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		curLSN = GetXLogReplayRecPtr(NULL);
+		if (trgLSN <= curLSN)
+			break;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("recovery is not in progress"),
+					errhint("Recovery ended before replaying the target LSN.")));
+
+		if (timeout > 0)
+		{
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+			latch_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (trgLSN > curLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("canceling waiting for LSN due to timeout")));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	int64		timeout = PG_GETARG_INT64(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called in non-atomic context")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+
+	(void) WaitForLSN(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919db..4b830dc3c85 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d1..79140d14116 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -79,6 +79,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 134e3b22fd8..153d816a053 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12153,6 +12153,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN with timeout',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..1da9c9b0556
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structures */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr trgLSN, int64 timeout);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr curLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..712924c2fad 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_replay_wait.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
new file mode 100644
index 00000000000..e93846425a9
--- /dev/null
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -0,0 +1,97 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and is the same as primary's LSN
+ok($output eq 30, "standby reached the same LSN as primary");
+
+# Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 10);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for unreachable LSN then promote.  Check the log for the relevant
+# error message.
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn3}');
+]);
+
+my $log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress',
+	$log_offset);
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8d7bed411f..f3f6dae1900 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3054,6 +3054,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#66Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Alexander Korotkov (#65)

On Sun, Mar 31, 2024 at 7:41 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi!

Thanks for the patch. I have a few comments on the v16 patch.

1. Is there any specific reason for pg_wal_replay_wait() being a
procedure rather than a function? I haven't read the full thread, but
I vaguely noticed the discussion on the new wait mechanism holding up
a snapshot or some other resource. Is that the reason to use a stored
procedure over a function? If yes, can we specify it somewhere in the
commit message and just before the procedure definition in
system_functions.sql?

2. Is the pg_wal_replay_wait first procedure that postgres provides
out of the box?

3. Defining a procedure for the first time in system_functions.sql
which is supposed to be for functions seems a bit unusual to me.

4.
+
+    endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+
+        if (timeout > 0)
+        {
+            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+            latch_events |= WL_TIMEOUT;
+            if (delay_ms <= 0)
+                break;
+        }

Why is endtime calculated even for timeout <= 0 only to just skip it
later? Can't we just do a fastpath exit if timeout = 0 and targetLSN <

5.
 Parameter
+        <parameter>timeout</parameter> is the time in milliseconds to wait
+        for the <parameter>target_lsn</parameter>
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.

It turns out to be "When timeout value equals to zero no timeout is
applied." I guess, we can just say something like the following which
I picked up from pg_terminate_backend timeout parameter description.

<para>
If <parameter>timeout</parameter> is not specified or zero, this
function returns if the WAL upto
<literal>target_lsn</literal> is replayed.
If the <parameter>timeout</parameter> is specified (in
milliseconds) and greater than zero, the function waits until the
server actually replays the WAL upto <literal>target_lsn</literal> or
until the given time has passed. On timeout, an error is emitted.
</para></entry>

6.
+        ereport(ERROR,
+                (errcode(ERRCODE_QUERY_CANCELED),
+                 errmsg("canceling waiting for LSN due to timeout")));

We can be a bit more informative here and say targetLSN and currentLSN
something like - "timed out while waiting for target LSN %X/%X to be
replayed; current LSN %X/%X"?

7.
+    if (context->atomic)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("pg_wal_replay_wait() must be only called in
non-atomic context")));
+

Can we say what a "non-atomic context '' is in a user understandable
way like explicit txn or whatever that might be? "non-atomic context'
might not sound great to the end -user.

8.
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses
<function>pg_current_wal_insert_lsn</function>
+    to get the <acronym>lsn</acronym> given that
<varname>synchronous_commit</varname>
+    could be set to <literal>off</literal>.

Can we just mention that run pg_current_wal_insert_lsn on the primary?

9. To me the following query blocks even though I didn't mention timeout.
CALL pg_wal_replay_wait('0/fffffff');

10. Can't we do some input validation on the timeout parameter and
emit an error for negative values just like pg_terminate_backend?
CALL pg_wal_replay_wait('0/ffffff', -100);

11.
+
+        if (timeout > 0)
+        {
+            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+            latch_events |= WL_TIMEOUT;
+            if (delay_ms <= 0)
+                break;
+        }
+

Can we avoid calling GetCurrentTimestamp in a for loop which can be
costly at times especially when pg_wal_replay_wait is called with
larger timeouts on multiple backends? Can't we reuse
pg_terminate_backend's timeout logic in
pg_wait_until_termination, perhaps reducing waittime to 1msec or so?

12. Why should we let every user run pg_wal_replay_wait procedure?
Can't we revoke execute from the public in system_functions.sql so
that one can decide who to run this function? Per comment #11, one can
easily cause a lot of activity by running this function on hundreds of
sessions.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#67Alexander Korotkov
aekorotkov@gmail.com
In reply to: Bharath Rupireddy (#66)

Hi Bharath,

Thank you for your feedback.

On Sun, Mar 31, 2024 at 8:44 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sun, Mar 31, 2024 at 7:41 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
Thanks for the patch. I have a few comments on the v16 patch.

1. Is there any specific reason for pg_wal_replay_wait() being a
procedure rather than a function? I haven't read the full thread, but
I vaguely noticed the discussion on the new wait mechanism holding up
a snapshot or some other resource. Is that the reason to use a stored
procedure over a function? If yes, can we specify it somewhere in the
commit message and just before the procedure definition in
system_functions.sql?

Surely, there is a reason. Function should be executed in a snapshot,
which can prevent WAL records from being replayed. See [1] for a
particular test scenario. In a procedure we may enforce non-atomic
context and release the snapshot.

I've mentioned that in the commit message and in the procedure code.
I don't think system_functions.sql is the place for this type of
comment. We only use system_functions.sql to push the default values.

2. Is the pg_wal_replay_wait first procedure that postgres provides
out of the box?

Yes, it appears first. I see nothing wrong about that.

3. Defining a procedure for the first time in system_functions.sql
which is supposed to be for functions seems a bit unusual to me.

From the scope of DDL and system catalogue procedure is just another
kind of function (prokind == 'p'). So, I don't feel wrong about that.

4.
+
+    endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+
+        if (timeout > 0)
+        {
+            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+            latch_events |= WL_TIMEOUT;
+            if (delay_ms <= 0)
+                break;
+        }

Why is endtime calculated even for timeout <= 0 only to just skip it
later? Can't we just do a fastpath exit if timeout = 0 and targetLSN <

OK, fixed.

5.
Parameter
+        <parameter>timeout</parameter> is the time in milliseconds to wait
+        for the <parameter>target_lsn</parameter>
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.
+        replay. When <parameter>timeout</parameter> value equals to zero no
+        timeout is applied.

It turns out to be "When timeout value equals to zero no timeout is
applied." I guess, we can just say something like the following which
I picked up from pg_terminate_backend timeout parameter description.

<para>
If <parameter>timeout</parameter> is not specified or zero, this
function returns if the WAL upto
<literal>target_lsn</literal> is replayed.
If the <parameter>timeout</parameter> is specified (in
milliseconds) and greater than zero, the function waits until the
server actually replays the WAL upto <literal>target_lsn</literal> or
until the given time has passed. On timeout, an error is emitted.
</para></entry>

Applied as you suggested with some edits from me.

6.
+        ereport(ERROR,
+                (errcode(ERRCODE_QUERY_CANCELED),
+                 errmsg("canceling waiting for LSN due to timeout")));

We can be a bit more informative here and say targetLSN and currentLSN
something like - "timed out while waiting for target LSN %X/%X to be
replayed; current LSN %X/%X"?

Done this way. Adjusted other ereport()'s as well.

7.
+    if (context->atomic)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("pg_wal_replay_wait() must be only called in
non-atomic context")));
+

Can we say what a "non-atomic context '' is in a user understandable
way like explicit txn or whatever that might be? "non-atomic context'
might not sound great to the end -user.

Added errdetail() to this ereport().

8.
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses
<function>pg_current_wal_insert_lsn</function>
+    to get the <acronym>lsn</acronym> given that
<varname>synchronous_commit</varname>
+    could be set to <literal>off</literal>.

Can we just mention that run pg_current_wal_insert_lsn on the primary?

The mention is added.

9. To me the following query blocks even though I didn't mention timeout.
CALL pg_wal_replay_wait('0/fffffff');

If your primary server is freshly initialized, you need to do quite
data modifications to reach this LSN.

10. Can't we do some input validation on the timeout parameter and
emit an error for negative values just like pg_terminate_backend?
CALL pg_wal_replay_wait('0/ffffff', -100);

Reasonable, added.

11.
+
+        if (timeout > 0)
+        {
+            delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+            latch_events |= WL_TIMEOUT;
+            if (delay_ms <= 0)
+                break;
+        }
+

Can we avoid calling GetCurrentTimestamp in a for loop which can be
costly at times especially when pg_wal_replay_wait is called with
larger timeouts on multiple backends? Can't we reuse
pg_terminate_backend's timeout logic in
pg_wait_until_termination, perhaps reducing waittime to 1msec or so?

Normally there shouldn't be many loops. It only happens on spurious
wakeups. For instance, some process was going to set our latch before
and for another reason, but due to kernel scheduling it does only now.
So, normally there is only one wakeup. pg_wait_until_termination()
may sacrifice timeout accuracy due to possible spurious wakeups and
time spent outside of WaitLatch(). I don't feel reasonable to repeat
this login in WaitForLSN() especially given that we don't need
frequent wakeups here.

12. Why should we let every user run pg_wal_replay_wait procedure?
Can't we revoke execute from the public in system_functions.sql so
that one can decide who to run this function? Per comment #11, one can
easily cause a lot of activity by running this function on hundreds of
sessions.

Generally, if a user can make many connections, then this user can
make them busy and can consume resources. Given my explanation above,
pg_wal_replay_wait() even wouldn't make the connection busy, it would
just wait on the latch. I don't see why pg_wal_replay_wait() could do
more harm than pg_sleep(). So, I would leave pg_wal_replay_wait()
public.

Links.
1. /messages/by-id/CAPpHfdtiGgn0iS1KbW2HTam-1+oK+vhXZDAcnX9hKaA7Oe=F-A@mail.gmail.com

------
Regards,
Alexander Korotkov

#68Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Alexander Korotkov (#67)

On Mon, Apr 1, 2024 at 5:54 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

9. To me the following query blocks even though I didn't mention timeout.
CALL pg_wal_replay_wait('0/fffffff');

If your primary server is freshly initialized, you need to do quite
data modifications to reach this LSN.

Right, but why pg_wal_replay_wait blocks without a timeout? It must
return an error saying it can't reach the target LSN, no?

Did you forget to attach the new patch?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#69Alexander Korotkov
aekorotkov@gmail.com
In reply to: Bharath Rupireddy (#68)
1 attachment(s)

On Mon, Apr 1, 2024 at 5:25 AM Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Apr 1, 2024 at 5:54 AM Alexander Korotkov <aekorotkov@gmail.com>
wrote:

9. To me the following query blocks even though I didn't mention

timeout.

CALL pg_wal_replay_wait('0/fffffff');

If your primary server is freshly initialized, you need to do quite
data modifications to reach this LSN.

Right, but why pg_wal_replay_wait blocks without a timeout? It must
return an error saying it can't reach the target LSN, no?

How can replica know this? It doesn't look feasible to distinguish this
situation from the situation when connection between primary and replica
became slow. I'd keep this simple. We have pg_sleep_for() which waits for
the specified time whatever it is. And we have pg_wal_replay_wait() waits
till replay lsn grows to the target whatever it is. It's up to the user to
specify the correct value.

Did you forget to attach the new patch?

Yes, here it is.

------
Regards,
Alexander Korotkov

Attachments:

v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patchapplication/octet-stream; name=v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patchDownload
From faee69adf4e8882cec86aaedf87bd81da421a467 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 28 Mar 2024 02:19:50 +0200
Subject: [PATCH v17] Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working in a non-atomic context,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
---
 doc/src/sgml/func.sgml                        | 113 ++++++
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 347 ++++++++++++++++++
 src/backend/storage/ipc/ipci.c                |   7 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/commands/waitlsn.h                |  43 +++
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wal_replay_wait.pl    |  97 +++++
 src/tools/pgindent/typedefs.list              |   2 +
 15 files changed, 646 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wal_replay_wait.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b694b2883c3..192959ebc11 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28284,6 +28284,119 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    There are also procedures to control the progress of recovery.
+    They are shown in <xref linkend="procedures-recovery-control-table"/>.
+    These procedures may be executed only during recovery.
+   </para>
+
+   <table id="procedures-recovery-control-table">
+    <title>Recovery Control Procedures</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        If <parameter>timeout</parameter> is not specified or zero, this
+        procedure returns once WAL is replayed upto
+        <literal>target_lsn</literal>.
+        If the <parameter>timeout</parameter> is specified (in
+        milliseconds) and greater than zero, the procedure waits until the
+        server actually replays the WAL upto <literal>target_lsn</literal> or
+        until the given time has passed. On timeout, an error is emitted.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 100);
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+    </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can't be used within
+     the transaction, another procedure or function.  Any of them holds a
+     snapshot, which could prevent the replay of WAL records.
+
+   <programlisting>
+postgres=# BEGIN;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+ERROR:  pg_wal_replay_wait() must be only called in non-atomic context
+DETAIL:  Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function.
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 20a5f862090..1446639ea09 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/waitlsn.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -6040,6 +6041,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before achieving the target LSN.
+	 */
+	WaitLSNSetLatches(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec0847..24ab1b2b213 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSN->minLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index fe2bb50f46d..a79bb80c951 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..2a66904b989
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,347 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "fmgr.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_type.h"
+#include "commands/waitlsn.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "utils/fmgrprotos.h"
+
+/* Add to / delete from shared memory array */
+static void addLSNWaiter(XLogRecPtr lsn);
+static void deleteLSNWaiter(void);
+
+struct WaitLSNState *waitLSN = NULL;
+static volatile sig_atomic_t haveShmemItem = false;
+
+/*
+ * Report the amount of shared memory space needed for WaitLSNState
+ */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		SpinLockInit(&waitLSN->mutex);
+		waitLSN->numWaitedProcs = 0;
+		pg_atomic_init_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+	}
+}
+
+/*
+ * Add the information about the LSN waiter backend to the shared memory
+ * array.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo cur;
+	int			i;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	cur.procnum = MyProcNumber;
+	cur.waitLSN = lsn;
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].waitLSN >= cur.waitLSN)
+		{
+			WaitLSNProcInfo tmp;
+
+			tmp = waitLSN->procInfos[i];
+			waitLSN->procInfos[i] = cur;
+			cur = tmp;
+		}
+	}
+	waitLSN->procInfos[i] = cur;
+	waitLSN->numWaitedProcs++;
+
+	pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Delete the information about the LSN waiter backend from the shared memory
+ * array.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	int			i;
+	bool		found = false;
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (waitLSN->procInfos[i].procnum == MyProcNumber)
+			found = true;
+
+		if (found && i < waitLSN->numWaitedProcs - 1)
+		{
+			waitLSN->procInfos[i] = waitLSN->procInfos[i + 1];
+		}
+	}
+
+	if (!found)
+	{
+		SpinLockRelease(&waitLSN->mutex);
+		return;
+	}
+	waitLSN->numWaitedProcs--;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+}
+
+/*
+ * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
+ * LSN waiters when InvalidXLogRecPtr is given.
+ */
+void
+WaitLSNSetLatches(XLogRecPtr currentLSN)
+{
+	int			i;
+	int		   *wakeUpProcNums;
+	int			numWakeUpProcs;
+
+	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
+
+	SpinLockAcquire(&waitLSN->mutex);
+
+	/*
+	 * Remember processes, whose waited LSNs are already replayed.  We should
+	 * set their latches later after spinlock release.
+	 */
+	for (i = 0; i < waitLSN->numWaitedProcs; i++)
+	{
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			waitLSN->procInfos[i].waitLSN > currentLSN)
+			break;
+
+		wakeUpProcNums[i] = waitLSN->procInfos[i].procnum;
+	}
+
+	/*
+	 * Immediately remove those processes from the shmem array.  Otherwise,
+	 * shmem array items will be here till corresponding processes wake up and
+	 * delete themselves.
+	 */
+	numWakeUpProcs = i;
+	for (i = 0; i < waitLSN->numWaitedProcs - numWakeUpProcs; i++)
+		waitLSN->procInfos[i] = waitLSN->procInfos[i + numWakeUpProcs];
+	waitLSN->numWaitedProcs -= numWakeUpProcs;
+
+	if (waitLSN->numWaitedProcs != 0)
+		pg_atomic_write_u64(&waitLSN->minLSN, waitLSN->procInfos[i].waitLSN);
+	else
+		pg_atomic_write_u64(&waitLSN->minLSN, PG_UINT64_MAX);
+
+	SpinLockRelease(&waitLSN->mutex);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		backend = GetPGProcByNumber(wakeUpProcNums[i]);
+		SetLatch(&backend->procLatch);
+	}
+	pfree(wakeUpProcNums);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	if (haveShmemItem)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	/* If target LSN is already replayed, exit immediately */
+	if (targetLSN <= GetXLogReplayRecPtr(NULL))
+		return;
+
+	if (timeout > 0)
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+
+	addLSNWaiter(targetLSN);
+	haveShmemItem = true;
+
+	for (;;)
+	{
+		int			rc;
+		int			latch_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		currentLSN = GetXLogReplayRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is not in progress"),
+					 errdetail("Recovery ended before replaying the target LSN %X/%X; last replay LSN %X/%X.",
+							   LSN_FORMAT_ARGS(targetLSN),
+							   LSN_FORMAT_ARGS(currentLSN))));
+
+		if (timeout > 0)
+		{
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+			latch_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	if (targetLSN > currentLSN)
+	{
+		deleteLSNWaiter();
+		haveShmemItem = false;
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+						LSN_FORMAT_ARGS(targetLSN),
+						LSN_FORMAT_ARGS(currentLSN))));
+	}
+	else
+	{
+		haveShmemItem = false;
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	int64		timeout = PG_GETARG_INT64(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we check that pg_wal_replay_wait() is called in a non-atomic
+	 * context.  That is, a procedure call isn't wrapped into a transaction,
+	 * another procedure call, or a function call.
+	 *
+	 * Secondly, according to PlannedStmtRequiresSnapshot(), even in an atomic
+	 * context, CallStmt is processed with a snapshot.  Thankfully, we can pop
+	 * this snapshot, because PortalRunUtility() can tolerate this.
+	 */
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called in non-atomic context"),
+				 errdetail("Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function.")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	(void) WaitForLSN(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..5aed90c9355 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -244,6 +246,11 @@ CreateSharedMemoryAndSemaphores(void)
 	/* Initialize subsystems */
 	CreateOrAttachShmemStructs();
 
+	/*
+	 * Init array of Latches in shared memory for wait lsn
+	 */
+	WaitLSNShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 162b1f919db..4b830dc3c85 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d1..79140d14116 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -79,6 +79,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 134e3b22fd8..153d816a053 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12153,6 +12153,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN with timeout',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..10ef63f0c09
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,43 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/* Shared memory structures */
+typedef struct WaitLSNProcInfo
+{
+	int			procnum;
+	XLogRecPtr	waitLSN;
+} WaitLSNProcInfo;
+
+typedef struct WaitLSNState
+{
+	pg_atomic_uint64 minLSN;
+	slock_t		mutex;
+	int			numWaitedProcs;
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr targetLSN, int64 timeout);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..712924c2fad 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_replay_wait.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
new file mode 100644
index 00000000000..9d2f91f4b28
--- /dev/null
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -0,0 +1,97 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby->start;
+
+
+# Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and is the same as primary's LSN
+ok($output eq 30, "standby reached the same LSN as primary");
+
+# Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 10);");
+$node_standby->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
+	stderr => \$stderr);
+ok( $stderr =~ /canceling waiting for LSN due to timeout/,
+	"get timeout on waiting for unreachable LSN");
+
+# Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for unreachable LSN then promote.  Check the log for the relevant
+# error message.
+my $psql_session = $node_standby->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn3}');
+]);
+
+my $log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+
+$node_standby->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8d7bed411f..f3f6dae1900 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3054,6 +3054,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#70Andy Fan
zhihuifan1213@163.com
In reply to: Alexander Korotkov (#69)

Alexander Korotkov <aekorotkov@gmail.com> writes:

Hello,

Did you forget to attach the new patch?

Yes, here it is.

------
Regards,
Alexander Korotkov

[4. text/x-diff; v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]...

+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>

Should we return the millseconds of waiting time? I think this
information may be useful for customer if they want to know how long
time it waits for for minitor purpose.

--
Best Regards
Andy Fan

#71Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andy Fan (#70)

Hi, Andy!

On Tue, Apr 2, 2024 at 6:29 AM Andy Fan <zhihuifan1213@163.com> wrote:

Did you forget to attach the new patch?

Yes, here it is.

[4. text/x-diff; v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]...

+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>

Should we return the millseconds of waiting time? I think this
information may be useful for customer if they want to know how long
time it waits for for minitor purpose.

Please, check it more carefully. In v17 timeout is in integer milliseconds.

------
Regards,
Alexander Korotkov

#72Andy Fan
zhihuifan1213@163.com
In reply to: Alexander Korotkov (#71)

Hi Alexander!

+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>

Should we return the millseconds of waiting time? I think this
information may be useful for customer if they want to know how long
time it waits for for minitor purpose.

Please, check it more carefully. In v17 timeout is in integer milliseconds.

I guess one of us misunderstand the other:( and I do didn't check the
code very carefully.

Acutally I meant the "return value" rather than function argument. IIUC
the current return value is void per below statement.

+ <returnvalue>void</returnvalue>

If so, when users call pg_wal_replay_wait, they can get informed when
the wal is replaied to the target_lsn, but they can't know how long time
it waits unless they check it in application side, I think such
information will be useful for monitor purpose sometimes.

select pg_wal_replay_wait(lsn, 1000); may just take 1ms in fact, in
this case, I want this function return 1.

--
Best Regards
Andy Fan

#73Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Andy Fan (#72)

On 2024-04-02 11:14, Andy Fan wrote:

If so, when users call pg_wal_replay_wait, they can get informed when
the wal is replaied to the target_lsn, but they can't know how long
time
it waits unless they check it in application side, I think such
information will be useful for monitor purpose sometimes.

select pg_wal_replay_wait(lsn, 1000); may just take 1ms in fact, in
this case, I want this function return 1.

Hi Andy, to get timing we can use \time in psql.
Here is an example.
postgres=# \timing
Timing is on.
postgres=# select 1;
?column?
----------
1
(1 row)

Time: 0.536 ms

<returnvalue>void</returnvalue>

And returning VOID is the best option, rather than returning TRUE|FALSE
or timing. It left the logic of the procedure very simple, we get an
error if LSN is not reached.

8 years, we tried to add this feature, and now we suggest the best way
for this feature is to commit the minimal version first.

Let's discuss further improvements in future versions.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

#74Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kartyshov Ivan (#73)

On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

8 years, we tried to add this feature, and now we suggest the best way
for this feature is to commit the minimal version first.

Just curious, do you or anyone else have an immediate use for this
function? If yes, how are they achieving read-after-write-consistency
on streaming standbys in their application right now without a
function like this?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#75Andy Fan
zhihuifan1213@163.com
In reply to: Bharath Rupireddy (#74)

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

8 years, we tried to add this feature, and now we suggest the best way
for this feature is to commit the minimal version first.

Just curious, do you or anyone else have an immediate use for this
function? If yes, how are they achieving read-after-write-consistency
on streaming standbys in their application right now without a
function like this?

The link [1]/messages/by-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC+A@mail.gmail.com may be helpful and I think the reason there is reasonable
to me.

Actually we also disucss how to make sure the "read your writes
consistency" internally, and the soluation here looks good to me.

Glad to know that this patch will be committed very soon.

[1]: /messages/by-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC+A@mail.gmail.com
/messages/by-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC+A@mail.gmail.com

--
Best Regards
Andy Fan

#76Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#73)

On Tue, Apr 2, 2024 at 1:11 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

On 2024-04-02 11:14, Andy Fan wrote:

If so, when users call pg_wal_replay_wait, they can get informed when
the wal is replaied to the target_lsn, but they can't know how long
time
it waits unless they check it in application side, I think such
information will be useful for monitor purpose sometimes.

select pg_wal_replay_wait(lsn, 1000); may just take 1ms in fact, in
this case, I want this function return 1.

Hi Andy, to get timing we can use \time in psql.
Here is an example.
postgres=# \timing
Timing is on.
postgres=# select 1;
?column?
----------
1
(1 row)

Time: 0.536 ms

<returnvalue>void</returnvalue>

And returning VOID is the best option, rather than returning TRUE|FALSE
or timing. It left the logic of the procedure very simple, we get an
error if LSN is not reached.

8 years, we tried to add this feature, and now we suggest the best way
for this feature is to commit the minimal version first.

Let's discuss further improvements in future versions.

+1,
It seems there was not yet a precedent of builtin PostgreSQL function
returning its duration. And I don't think we need to introduce such
precedent, at least now. This seems like we're placing the
responsibility on monitoring resources usage to an application.

I'd also like to note that pg_wal_replay_wait() comes with a dedicated
wait event. So, one could monitor the average duration of these waits
using sampling of wait events.

------
Regards,
Alexander Korotkov

#77Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andy Fan (#75)

On Tue, Apr 2, 2024 at 2:47 PM Andy Fan <zhihuifan1213@163.com> wrote:

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:

On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

8 years, we tried to add this feature, and now we suggest the best way
for this feature is to commit the minimal version first.

Just curious, do you or anyone else have an immediate use for this
function? If yes, how are they achieving read-after-write-consistency
on streaming standbys in their application right now without a
function like this?

The link [1] may be helpful and I think the reason there is reasonable
to me.

Actually we also disucss how to make sure the "read your writes
consistency" internally, and the soluation here looks good to me.

Glad to know that this patch will be committed very soon.

[1]
/messages/by-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC+A@mail.gmail.com

Thank you for your feedback.

I also can confirm that a lot of users would be very happy to have
"read your writes consistency" and ready to do something to achieve
this at an application level. However, they typically don't know what
exactly they need.

So, blogging about pg_wal_replay_wait() and spreading words about it
at conferences would be highly appreciated.

------
Regards,
Alexander Korotkov

#78Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Bharath Rupireddy (#74)

On 2024-04-02 13:15, Bharath Rupireddy wrote:

On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

8 years, we tried to add this feature, and now we suggest the best way
for this feature is to commit the minimal version first.

Just curious, do you or anyone else have an immediate use for this
function? If yes, how are they achieving read-after-write-consistency
on streaming standbys in their application right now without a
function like this?

Just now, application runs pg_current_wal_lsn() after update and then
waits on loop pg_current_wal_flush_lsn() until updated.

Or use slow synchronous_commit.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

#79Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alexander Korotkov (#69)

Hello, I noticed that commit 06c418e163e9 uses waitLSN->mutex (a
spinlock) to protect the contents of waitLSN -- and it's used to walk an
arbitrary long list of processes waiting ... and also, an arbitrary
number of processes could be calling this code. I think using a
spinlock for this is unwise, as it'll cause busy-waiting whenever
there's contention. Wouldn't it be better to use an LWLock for this?
Then the processes would sleep until the lock is freed.

While nosing about the code, other things struck me:

I think there should be more comments about WaitLSNProcInfo and
WaitLSNState in waitlsn.h.

In addLSNWaiter it'd be better to assign 'cur' before acquiring the
lock.

Is a plan array really the most efficient data structure for this,
considering that you have to reorder each time you add an element?
Maybe it is, but wouldn't it make sense to use memmove() when adding one
element rather iterating all the remaining elements to the end of the
queue?

I think the include list in waitlsn.c could be tightened a bit:

@@ -18,28 +18,18 @@
#include <math.h>

 #include "pgstat.h"
-#include "fmgr.h"
-#include "access/transam.h"
-#include "access/xact.h"
 #include "access/xlog.h"
-#include "access/xlogdefs.h"
 #include "access/xlogrecovery.h"
-#include "catalog/pg_type.h"
 #include "commands/waitlsn.h"
-#include "executor/spi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "storage/ipc.h"
 #include "storage/latch.h"
-#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/shmem.h"
-#include "storage/sinvaladt.h"
-#include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 #include "utils/fmgrprotos.h"
+#include "utils/wait_event_types.h"

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)

#80Daniel Gustafsson
daniel@yesql.se
In reply to: Alvaro Herrera (#79)

Buildfarm animal mamba (NetBSD 10.0 on macppc) started failing on this with
what seems like a bogus compiler warning:

waitlsn.c: In function 'WaitForLSN':
waitlsn.c:275:24: error: 'endtime' may be used uninitialized in this function [-Werror=maybe-uninitialized]
275 | delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
| ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

endtime is indeed initialized further up, but initializing endtime at
declaration seems innocent enough and should make this compiler happy and the
buildfarm greener.

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 6679378156..17ad0057ad 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -226,7 +226,7 @@ void
 WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
 {
        XLogRecPtr      currentLSN;
-       TimestampTz endtime;
+       TimestampTz endtime = 0;

/* Shouldn't be called when shmem isn't initialized */
Assert(waitLSN);

--
Daniel Gustafsson

#81Andy Fan
zhihuifan1213@163.com
In reply to: Alexander Korotkov (#77)

I also can confirm that a lot of users would be very happy to have
"read your writes consistency" and ready to do something to achieve
this at an application level. However, they typically don't know what
exactly they need.

So, blogging about pg_wal_replay_wait() and spreading words about it
at conferences would be highly appreciated.

Sure, once it is committed, I promise I can doing a knowledge sharing in
our organization and write a article in chinese language.

--
Best Regards
Andy Fan

#82Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alvaro Herrera (#79)

Hi, Alvaro!

Thank you for your feedback.

On Wed, Apr 3, 2024 at 9:58 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hello, I noticed that commit 06c418e163e9 uses waitLSN->mutex (a
spinlock) to protect the contents of waitLSN -- and it's used to walk an
arbitrary long list of processes waiting ... and also, an arbitrary
number of processes could be calling this code. I think using a
spinlock for this is unwise, as it'll cause busy-waiting whenever
there's contention. Wouldn't it be better to use an LWLock for this?
Then the processes would sleep until the lock is freed.

While nosing about the code, other things struck me:

I think there should be more comments about WaitLSNProcInfo and
WaitLSNState in waitlsn.h.

In addLSNWaiter it'd be better to assign 'cur' before acquiring the
lock.

Is a plan array really the most efficient data structure for this,
considering that you have to reorder each time you add an element?
Maybe it is, but wouldn't it make sense to use memmove() when adding one
element rather iterating all the remaining elements to the end of the
queue?

I think the include list in waitlsn.c could be tightened a bit:

I've just pushed commit, which shortens the include list and fixes the
order of 'cur' assigning and taking spinlock in addLSNWaiter().

Regarding the shmem data structure for LSN waiters. I didn't pick
LWLock or ConditionVariable, because I needed the ability to wake up
only those waiters whose LSN is already replayed. In my experience
waking up a process is way slower than scanning a short flat array.

However, I agree that when the number of waiters is very high and flat
array may become a problem. It seems that the pairing heap is not
hard to use for shmem structures. The only memory allocation call in
paritingheap.c is in pairingheap_allocate(). So, it's only needed to
be able to initialize the pairing heap in-place, and it will be fine
for shmem.

I'll come back with switching to the pairing heap shortly.

------
Regards,
Alexander Korotkov

#83Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alexander Korotkov (#82)
1 attachment(s)

Hello Alexander,

On 2024-Apr-03, Alexander Korotkov wrote:

Regarding the shmem data structure for LSN waiters. I didn't pick
LWLock or ConditionVariable, because I needed the ability to wake up
only those waiters whose LSN is already replayed. In my experience
waking up a process is way slower than scanning a short flat array.

I agree, but I think that's unrelated to what I was saying, which is
just the patch I attach here.

However, I agree that when the number of waiters is very high and flat
array may become a problem. It seems that the pairing heap is not
hard to use for shmem structures. The only memory allocation call in
paritingheap.c is in pairingheap_allocate(). So, it's only needed to
be able to initialize the pairing heap in-place, and it will be fine
for shmem.

Ok.

With the code as it stands today, everything in WaitLSNState apart from
the pairing heap is accessed without any locking. I think this is at
least partly OK because each backend only accesses its own entry; but it
deserves a comment. Or maybe something more, because WaitLSNSetLatches
does modify the entry for other backends. (Admittedly, this could only
happens for backends that are already sleeping, and it only happens
with the lock acquired, so it's probably okay. But clearly it deserves
a comment.)

Don't we need to WaitLSNCleanup() during error recovery or something?

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"World domination is proceeding according to plan" (Andrew Morton)

Attachments:

0001-Use-an-LWLock-instead-of-a-spinlock-in-waitlsn.c.patchtext/x-diff; charset=utf-8Download
From 4079dc6a6a6893055b32dee75f178b324bbaef77 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 3 Apr 2024 18:35:15 +0200
Subject: [PATCH] Use an LWLock instead of a spinlock in waitlsn.c

---
 src/backend/commands/waitlsn.c                  | 15 +++++++--------
 src/backend/utils/activity/wait_event_names.txt |  1 +
 src/include/commands/waitlsn.h                  |  5 +----
 src/include/storage/lwlocklist.h                |  1 +
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 51a34d422e..a57b818a2d 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -58,7 +58,6 @@ WaitLSNShmemInit(void)
 											   &found);
 	if (!found)
 	{
-		SpinLockInit(&waitLSN->waitersHeapMutex);
 		pg_atomic_init_u64(&waitLSN->minWaitedLSN, PG_UINT64_MAX);
 		pairingheap_initialize(&waitLSN->waitersHeap, lsn_cmp, NULL);
 		memset(&waitLSN->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
@@ -115,13 +114,13 @@ addLSNWaiter(XLogRecPtr lsn)
 	procInfo->procnum = MyProcNumber;
 	procInfo->waitLSN = lsn;
 
-	SpinLockAcquire(&waitLSN->waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	pairingheap_add(&waitLSN->waitersHeap, &procInfo->phNode);
 	procInfo->inHeap = true;
 	updateMinWaitedLSN();
 
-	SpinLockRelease(&waitLSN->waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 }
 
 /*
@@ -132,11 +131,11 @@ deleteLSNWaiter(void)
 {
 	WaitLSNProcInfo *procInfo = &waitLSN->procInfos[MyProcNumber];
 
-	SpinLockAcquire(&waitLSN->waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	if (!procInfo->inHeap)
 	{
-		SpinLockRelease(&waitLSN->waitersHeapMutex);
+		LWLockRelease(WaitLSNLock);
 		return;
 	}
 
@@ -144,7 +143,7 @@ deleteLSNWaiter(void)
 	procInfo->inHeap = false;
 	updateMinWaitedLSN();
 
-	SpinLockRelease(&waitLSN->waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 }
 
 /*
@@ -160,7 +159,7 @@ WaitLSNSetLatches(XLogRecPtr currentLSN)
 
 	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
 
-	SpinLockAcquire(&waitLSN->waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	/*
 	 * Iterate the pairing heap of waiting processes till we find LSN not yet
@@ -182,7 +181,7 @@ WaitLSNSetLatches(XLogRecPtr currentLSN)
 
 	updateMinWaitedLSN();
 
-	SpinLockRelease(&waitLSN->waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 
 	/*
 	 * Set latches for processes, whose waited LSNs are already replayed. This
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0d288d6b3d..ec43a78d60 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -330,6 +330,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
index 0d80248682..b3d9eed64d 100644
--- a/src/include/commands/waitlsn.h
+++ b/src/include/commands/waitlsn.h
@@ -55,13 +55,10 @@ typedef struct WaitLSNState
 
 	/*
 	 * A pairing heap of waiting processes order by LSN values (least LSN is
-	 * on top).
+	 * on top).  Protected by WaitLSNLock.
 	 */
 	pairingheap waitersHeap;
 
-	/* A mutex protecting the pairing heap above */
-	slock_t		waitersHeapMutex;
-
 	/* An array with per-process information, indexed by the process number */
 	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
 } WaitLSNState;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 85f6568b9e..c2bab5a794 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
-- 
2.39.2

#84Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alvaro Herrera (#83)

On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Apr-03, Alexander Korotkov wrote:

Regarding the shmem data structure for LSN waiters. I didn't pick
LWLock or ConditionVariable, because I needed the ability to wake up
only those waiters whose LSN is already replayed. In my experience
waking up a process is way slower than scanning a short flat array.

I agree, but I think that's unrelated to what I was saying, which is
just the patch I attach here.

Oh, sorry for the confusion. I'd re-read your message. Indeed you
meant this very clearly!

I'm good with the patch. Attached revision contains a bit of a commit message.

However, I agree that when the number of waiters is very high and flat
array may become a problem. It seems that the pairing heap is not
hard to use for shmem structures. The only memory allocation call in
paritingheap.c is in pairingheap_allocate(). So, it's only needed to
be able to initialize the pairing heap in-place, and it will be fine
for shmem.

Ok.

With the code as it stands today, everything in WaitLSNState apart from
the pairing heap is accessed without any locking. I think this is at
least partly OK because each backend only accesses its own entry; but it
deserves a comment. Or maybe something more, because WaitLSNSetLatches
does modify the entry for other backends. (Admittedly, this could only
happens for backends that are already sleeping, and it only happens
with the lock acquired, so it's probably okay. But clearly it deserves
a comment.)

Please, check 0002 patch attached. I found it easier to move two
assignments we previously moved out of lock, into the lock; then claim
WaitLSNState.procInfos is also protected by WaitLSNLock.

Don't we need to WaitLSNCleanup() during error recovery or something?

Yes, there is WaitLSNCleanup(). It's currently only called from one
place, given that waiting for LSN can't be invoked from background
workers or inside the transaction.

------
Regards,
Alexander Korotkov

#85Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Alexander Korotkov (#84)

Hi, Alexander!

On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2024-Apr-03, Alexander Korotkov wrote:

Regarding the shmem data structure for LSN waiters. I didn't pick
LWLock or ConditionVariable, because I needed the ability to wake up
only those waiters whose LSN is already replayed. In my experience
waking up a process is way slower than scanning a short flat array.

I agree, but I think that's unrelated to what I was saying, which is
just the patch I attach here.

Oh, sorry for the confusion. I'd re-read your message. Indeed you
meant this very clearly!

I'm good with the patch. Attached revision contains a bit of a commit
message.

However, I agree that when the number of waiters is very high and flat
array may become a problem. It seems that the pairing heap is not
hard to use for shmem structures. The only memory allocation call in
paritingheap.c is in pairingheap_allocate(). So, it's only needed to
be able to initialize the pairing heap in-place, and it will be fine
for shmem.

Ok.

With the code as it stands today, everything in WaitLSNState apart from
the pairing heap is accessed without any locking. I think this is at
least partly OK because each backend only accesses its own entry; but it
deserves a comment. Or maybe something more, because WaitLSNSetLatches
does modify the entry for other backends. (Admittedly, this could only
happens for backends that are already sleeping, and it only happens
with the lock acquired, so it's probably okay. But clearly it deserves
a comment.)

Please, check 0002 patch attached. I found it easier to move two
assignments we previously moved out of lock, into the lock; then claim
WaitLSNState.procInfos is also protected by WaitLSNLock.

Could you re-attach 0002. Seems it failed to attach to the previous
message.

Regards,
Pavel

#86Alexander Korotkov
aekorotkov@gmail.com
In reply to: Pavel Borisov (#85)
2 attachment(s)

On Wed, Apr 3, 2024 at 10:04 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Wed, 3 Apr 2024 at 22:18, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Apr 3, 2024 at 7:55 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Apr-03, Alexander Korotkov wrote:

Regarding the shmem data structure for LSN waiters. I didn't pick
LWLock or ConditionVariable, because I needed the ability to wake up
only those waiters whose LSN is already replayed. In my experience
waking up a process is way slower than scanning a short flat array.

I agree, but I think that's unrelated to what I was saying, which is
just the patch I attach here.

Oh, sorry for the confusion. I'd re-read your message. Indeed you
meant this very clearly!

I'm good with the patch. Attached revision contains a bit of a commit message.

However, I agree that when the number of waiters is very high and flat
array may become a problem. It seems that the pairing heap is not
hard to use for shmem structures. The only memory allocation call in
paritingheap.c is in pairingheap_allocate(). So, it's only needed to
be able to initialize the pairing heap in-place, and it will be fine
for shmem.

Ok.

With the code as it stands today, everything in WaitLSNState apart from
the pairing heap is accessed without any locking. I think this is at
least partly OK because each backend only accesses its own entry; but it
deserves a comment. Or maybe something more, because WaitLSNSetLatches
does modify the entry for other backends. (Admittedly, this could only
happens for backends that are already sleeping, and it only happens
with the lock acquired, so it's probably okay. But clearly it deserves
a comment.)

Please, check 0002 patch attached. I found it easier to move two
assignments we previously moved out of lock, into the lock; then claim
WaitLSNState.procInfos is also protected by WaitLSNLock.

Could you re-attach 0002. Seems it failed to attach to the previous message.

I actually forgot both!

------
Regards,
Alexander Korotkov

Attachments:

v2-0001-Use-an-LWLock-instead-of-a-spinlock-in-waitlsn.c.patchapplication/octet-stream; name=v2-0001-Use-an-LWLock-instead-of-a-spinlock-in-waitlsn.c.patchDownload
From ecd4502ef541bcbb8991c780d6a89ba4d5d1fb8e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 3 Apr 2024 18:35:15 +0200
Subject: [PATCH v2 1/2] Use an LWLock instead of a spinlock in waitlsn.c

This should prevent busy-waiting when number of waiting processes is high.

Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
Author: Alvaro Herrera
---
 src/backend/commands/waitlsn.c                  | 15 +++++++--------
 src/backend/utils/activity/wait_event_names.txt |  1 +
 src/include/commands/waitlsn.h                  |  5 +----
 src/include/storage/lwlocklist.h                |  1 +
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 51a34d422e2..a57b818a2d4 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -58,7 +58,6 @@ WaitLSNShmemInit(void)
 											   &found);
 	if (!found)
 	{
-		SpinLockInit(&waitLSN->waitersHeapMutex);
 		pg_atomic_init_u64(&waitLSN->minWaitedLSN, PG_UINT64_MAX);
 		pairingheap_initialize(&waitLSN->waitersHeap, lsn_cmp, NULL);
 		memset(&waitLSN->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
@@ -115,13 +114,13 @@ addLSNWaiter(XLogRecPtr lsn)
 	procInfo->procnum = MyProcNumber;
 	procInfo->waitLSN = lsn;
 
-	SpinLockAcquire(&waitLSN->waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	pairingheap_add(&waitLSN->waitersHeap, &procInfo->phNode);
 	procInfo->inHeap = true;
 	updateMinWaitedLSN();
 
-	SpinLockRelease(&waitLSN->waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 }
 
 /*
@@ -132,11 +131,11 @@ deleteLSNWaiter(void)
 {
 	WaitLSNProcInfo *procInfo = &waitLSN->procInfos[MyProcNumber];
 
-	SpinLockAcquire(&waitLSN->waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	if (!procInfo->inHeap)
 	{
-		SpinLockRelease(&waitLSN->waitersHeapMutex);
+		LWLockRelease(WaitLSNLock);
 		return;
 	}
 
@@ -144,7 +143,7 @@ deleteLSNWaiter(void)
 	procInfo->inHeap = false;
 	updateMinWaitedLSN();
 
-	SpinLockRelease(&waitLSN->waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 }
 
 /*
@@ -160,7 +159,7 @@ WaitLSNSetLatches(XLogRecPtr currentLSN)
 
 	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
 
-	SpinLockAcquire(&waitLSN->waitersHeapMutex);
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
 
 	/*
 	 * Iterate the pairing heap of waiting processes till we find LSN not yet
@@ -182,7 +181,7 @@ WaitLSNSetLatches(XLogRecPtr currentLSN)
 
 	updateMinWaitedLSN();
 
-	SpinLockRelease(&waitLSN->waitersHeapMutex);
+	LWLockRelease(WaitLSNLock);
 
 	/*
 	 * Set latches for processes, whose waited LSNs are already replayed. This
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0d288d6b3d8..ec43a78d603 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -330,6 +330,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
index 0d80248682c..b3d9eed64d8 100644
--- a/src/include/commands/waitlsn.h
+++ b/src/include/commands/waitlsn.h
@@ -55,13 +55,10 @@ typedef struct WaitLSNState
 
 	/*
 	 * A pairing heap of waiting processes order by LSN values (least LSN is
-	 * on top).
+	 * on top).  Protected by WaitLSNLock.
 	 */
 	pairingheap waitersHeap;
 
-	/* A mutex protecting the pairing heap above */
-	slock_t		waitersHeapMutex;
-
 	/* An array with per-process information, indexed by the process number */
 	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
 } WaitLSNState;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 85f6568b9e4..c2bab5a794e 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
-- 
2.39.3 (Apple Git-145)

v2-0002-Clarify-what-is-protected-by-WaitLSNLock.patchapplication/octet-stream; name=v2-0002-Clarify-what-is-protected-by-WaitLSNLock.patchDownload
From 4a805b3dbd9a324c7ab576eb572c3e5857ca8059 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 3 Apr 2024 20:54:46 +0300
Subject: [PATCH v2 2/2] Clarify what is protected by WaitLSNLock

Not just WaitLSNState.waitersHeap, but also WaitLSNState.procInfos and
updating of WaitLSNState.minWaitedLSN is protected by WaitLSNLock.  There
is one now documented exclusion on fast-path checking of
WaitLSNProcInfo.inHeap flag.

Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
---
 src/backend/commands/waitlsn.c | 10 ++++++++--
 src/include/commands/waitlsn.h |  7 +++++--
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index a57b818a2d4..1a83c34e09f 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -109,13 +109,13 @@ addLSNWaiter(XLogRecPtr lsn)
 {
 	WaitLSNProcInfo *procInfo = &waitLSN->procInfos[MyProcNumber];
 
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
 	Assert(!procInfo->inHeap);
 
 	procInfo->procnum = MyProcNumber;
 	procInfo->waitLSN = lsn;
 
-	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
-
 	pairingheap_add(&waitLSN->waitersHeap, &procInfo->phNode);
 	procInfo->inHeap = true;
 	updateMinWaitedLSN();
@@ -203,6 +203,12 @@ WaitLSNSetLatches(XLogRecPtr currentLSN)
 void
 WaitLSNCleanup(void)
 {
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
 	if (waitLSN->procInfos[MyProcNumber].inHeap)
 		deleteLSNWaiter();
 }
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
index b3d9eed64d8..da17b8be6f9 100644
--- a/src/include/commands/waitlsn.h
+++ b/src/include/commands/waitlsn.h
@@ -49,7 +49,7 @@ typedef struct WaitLSNState
 	/*
 	 * The minimum LSN value some process is waiting for.  Used for the
 	 * fast-path checking if we need to wake up any waiters after replaying a
-	 * WAL record.
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
 	 */
 	pg_atomic_uint64 minWaitedLSN;
 
@@ -59,7 +59,10 @@ typedef struct WaitLSNState
 	 */
 	pairingheap waitersHeap;
 
-	/* An array with per-process information, indexed by the process number */
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
 	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
 } WaitLSNState;
 
-- 
2.39.3 (Apple Git-145)

#87Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Alexander Korotkov (#86)

Hello,

BTW I noticed that
https://coverage.postgresql.org/src/backend/commands/waitlsn.c.gcov.html
says that lsn_cmp is not covered by the tests. This probably indicates
that the tests are a little too light, but I'm not sure how much extra
effort we want to spend.

I'm still concerned that WaitLSNCleanup is only called in ProcKill.
Does this mean that if a process throws an error while waiting, it'll
not get cleaned up until it exits? Maybe this is not a big deal, but it
seems odd.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Now I have my system running, not a byte was off the shelf;
It rarely breaks and when it does I fix the code myself.
It's stable, clean and elegant, and lightning fast as well,
And it doesn't cost a nickel, so Bill Gates can go to hell."

#88Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alvaro Herrera (#87)

Hi, Alvaro!

Thank you for your care on this matter.

On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

BTW I noticed that
https://coverage.postgresql.org/src/backend/commands/waitlsn.c.gcov.html
says that lsn_cmp is not covered by the tests. This probably indicates
that the tests are a little too light, but I'm not sure how much extra
effort we want to spend.

I'm aware of this. Ivan promised to send a patch to improve the test.
If he doesn't, I'll care about it.

I'm still concerned that WaitLSNCleanup is only called in ProcKill.
Does this mean that if a process throws an error while waiting, it'll
not get cleaned up until it exits? Maybe this is not a big deal, but it
seems odd.

I've added WaitLSNCleanup() to the AbortTransaction(). Just pushed
that together with the improvements upthread.

------
Regards,
Alexander Korotkov

#89Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Alvaro Herrera (#87)
1 attachment(s)

I did some experiments over synchronous replications and
got that cascade replication can`t be synchronous. And 
pg_wal_replay_wait() allows us to read your writes
consistency on cascade replication.
Beyond that, I added more tests on multi-standby replication
and cascade replications.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

additional_test.patchtext/x-diff; name=additional_test.patchDownload
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index bbd64aa67b..867c3dd94a 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -19,17 +19,17 @@ my $backup_name = 'my_backup';
 $node_primary->backup($backup_name);
 
 # Create a streaming standby with a 1 second delay from the backup
-my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
 my $delay = 1;
-$node_standby->init_from_backup($node_primary, $backup_name,
+$node_standby1->init_from_backup($node_primary, $backup_name,
 	has_streaming => 1);
-$node_standby->append_conf(
+$node_standby1->append_conf(
 	'postgresql.conf', qq[
 	recovery_min_apply_delay = '${delay}s'
 ]);
-$node_standby->start;
-
+$node_standby1->start;
 
+# I
 # Make sure that pg_wal_replay_wait() works: add new content to
 # primary and memorize primary's insert LSN, then wait for that LSN to be
 # replayed on standby.
@@ -37,7 +37,7 @@ $node_primary->safe_psql('postgres',
 	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
 my $lsn1 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $output = $node_standby->safe_psql(
+my $output = $node_standby1->safe_psql(
 	'postgres', qq[
 	CALL pg_wal_replay_wait('${lsn1}', 1000000);
 	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
@@ -48,12 +48,13 @@ my $output = $node_standby->safe_psql(
 ok($output >= 0,
 	"standby reached the same LSN as primary after pg_wal_replay_wait()");
 
+# II
 # Check that new data is visible after calling pg_wal_replay_wait()
 $node_primary->safe_psql('postgres',
 	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
 my $lsn2 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-$output = $node_standby->safe_psql(
+$output = $node_standby1->safe_psql(
 	'postgres', qq[
 	CALL pg_wal_replay_wait('${lsn2}');
 	SELECT count(*) FROM wait_test;
@@ -62,6 +63,82 @@ $output = $node_standby->safe_psql(
 # Make sure the current LSN on standby and is the same as primary's LSN
 ok($output eq 30, "standby reached the same LSN as primary");
 
+# III
+# Check two standby waiting LSN
+# Create a streaming second standby with a 1 second delay from the backup
+my $node_standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby2->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby2->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+my $output2 = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 40, "standby1 reached the same LSN as primary");
+ok($output2 eq 40, "standby2 reached the same LSN as primary");
+
+# IV
+# Create a cascading standby waiting LSN
+$backup_name = 'cas_backup';
+$node_standby1->backup($backup_name);
+
+my $cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+$cascading_standby->init_from_backup($node_standby1, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $cascading_connstr = $node_standby1->connstr;
+$cascading_standby->append_conf(
+	'postgresql.conf', qq(
+	hot_standby_feedback = on
+	recovery_min_apply_delay = '${delay}s'
+));
+
+$cascading_standby->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+$output2 = $cascading_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 50, "standby1 reached the same LSN as primary");
+ok($output2 eq 50, "cascading_standby reached the same LSN as primary");
+
+# V
 # Check that waiting for unreachable LSN triggers the timeout.  The
 # unreachable LSN must be well in advance.  So WAL records issued by
 # the concurrent autovacuum could not affect that.
@@ -69,29 +146,31 @@ my $lsn3 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
 my $stderr;
-$node_standby->safe_psql('postgres',
+$node_standby1->safe_psql('postgres',
 	"CALL pg_wal_replay_wait('${lsn2}', 10);");
-$node_standby->psql(
+$node_standby1->psql(
 	'postgres',
 	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
 	stderr => \$stderr);
 ok( $stderr =~ /timed out while waiting for target LSN/,
 	"get timeout on waiting for unreachable LSN");
 
+# VI
 # Check that the standby promotion terminates the wait on LSN.  Start
 # waiting for unreachable LSN then promote.  Check the log for the relevant
 # error message.
-my $psql_session = $node_standby->background_psql('postgres');
+my $psql_session = $node_standby1->background_psql('postgres');
 $psql_session->query_until(
 	qr/start/, qq[
 	\\echo start
 	CALL pg_wal_replay_wait('${lsn3}');
 ]);
 
-my $log_offset = -s $node_standby->logfile;
-$node_standby->promote;
-$node_standby->wait_for_log('recovery is not in progress', $log_offset);
+my $log_offset = -s $node_standby1->logfile;
+$node_standby1->promote;
+$node_standby1->wait_for_log('recovery is not in progress', $log_offset);
 
-$node_standby->stop;
+$node_standby1->stop;
+$node_standby2->stop;
 $node_primary->stop;
 done_testing();
#90Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alexander Korotkov (#88)

On 07/04/2024 00:52, Alexander Korotkov wrote:

On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I'm still concerned that WaitLSNCleanup is only called in ProcKill.
Does this mean that if a process throws an error while waiting, it'll
not get cleaned up until it exits? Maybe this is not a big deal, but it
seems odd.

I've added WaitLSNCleanup() to the AbortTransaction(). Just pushed
that together with the improvements upthread.

Race condition:

1. backend: pg_wal_replay_wait('0/1234') is called. It calls WaitForLSN
2. backend: WaitForLSN calls addLSNWaiter('0/1234'). It adds the backend
process to the LSN heap and returns
3. replay: rm_redo record '0/1234'
4. backend: WaitForLSN enters for-loop, calls GetXLogReplayRecPtr()
5. backend: current replay LSN location is '0/1234', so we exit the loop
6. replay: calls WaitLSNSetLatches()

In a nutshell, it's possible for the loop in WaitForLSN to exit without
cleaning up the process from the heap. I was able to hit that by adding
a delay after the addLSNWaiter() call:

TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]

I think there's a similar race condition if the timeout is reached at
the same time that the startup process wakes up the process.

* At first, we check that pg_wal_replay_wait() is called in a non-atomic
* context. That is, a procedure call isn't wrapped into a transaction,
* another procedure call, or a function call.
*

It's pretty unfortunate to have all these restrictions. It would be nice
to do:

select pg_wal_replay_wait('0/1234'); select * from foo;

in a single multi-query call, to avoid the round-trip to the client. You
can avoid it with libpq or protocol level pipelining, too, but it's more
complicated.

* Secondly, according to PlannedStmtRequiresSnapshot(), even in an atomic
* context, CallStmt is processed with a snapshot. Thankfully, we can pop
* this snapshot, because PortalRunUtility() can tolerate this.

This assumption that PortalRunUtility() can tolerate us popping the
snapshot sounds very fishy. I haven't looked at what's going on there,
but doesn't sound like a great assumption.

If recovery ends while a process is waiting for an LSN to arrive, does
it ever get woken up?

The docs could use some-copy-editing, but just to point out one issue:

There are also procedures to control the progress of recovery.

That's copy-pasted from an earlier sentence at the table that lists
functions like pg_promote(), pg_wal_replay_pause(), and
pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control the
progress of recovery like those functions do, it only causes the calling
backend to wait.

Overall, this feature doesn't feel quite ready for v17, and IMHO should
be reverted. It's a nice feature, so I'd love to have it fixed and
reviewed early in the v18 cycle.

--
Heikki Linnakangas
Neon (https://neon.tech)

#91Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#90)

Hi, Heikki!

Thank you for your interest in the subject.

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/04/2024 00:52, Alexander Korotkov wrote:

On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I'm still concerned that WaitLSNCleanup is only called in ProcKill.
Does this mean that if a process throws an error while waiting, it'll
not get cleaned up until it exits? Maybe this is not a big deal, but it
seems odd.

I've added WaitLSNCleanup() to the AbortTransaction(). Just pushed
that together with the improvements upthread.

Race condition:

1. backend: pg_wal_replay_wait('0/1234') is called. It calls WaitForLSN
2. backend: WaitForLSN calls addLSNWaiter('0/1234'). It adds the backend
process to the LSN heap and returns
3. replay: rm_redo record '0/1234'
4. backend: WaitForLSN enters for-loop, calls GetXLogReplayRecPtr()
5. backend: current replay LSN location is '0/1234', so we exit the loop
6. replay: calls WaitLSNSetLatches()

In a nutshell, it's possible for the loop in WaitForLSN to exit without
cleaning up the process from the heap. I was able to hit that by adding
a delay after the addLSNWaiter() call:

TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]

I think there's a similar race condition if the timeout is reached at
the same time that the startup process wakes up the process.

Thank you for catching this. I think WaitForLSN() just needs to call
deleteLSNWaiter() unconditionally after exit from the loop.

* At first, we check that pg_wal_replay_wait() is called in a non-atomic
* context. That is, a procedure call isn't wrapped into a transaction,
* another procedure call, or a function call.
*

It's pretty unfortunate to have all these restrictions. It would be nice
to do:

select pg_wal_replay_wait('0/1234'); select * from foo;

This works for me, except that it needs "call" not "select".

# call pg_wal_replay_wait('0/1234'); select * from foo;
CALL
i
---
(0 rows)

in a single multi-query call, to avoid the round-trip to the client. You
can avoid it with libpq or protocol level pipelining, too, but it's more
complicated.

* Secondly, according to PlannedStmtRequiresSnapshot(), even in an atomic
* context, CallStmt is processed with a snapshot. Thankfully, we can pop
* this snapshot, because PortalRunUtility() can tolerate this.

This assumption that PortalRunUtility() can tolerate us popping the
snapshot sounds very fishy. I haven't looked at what's going on there,
but doesn't sound like a great assumption.

This is what PortalRunUtility() says about this.

/*
* Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
* under us, so don't complain if it's now empty. Otherwise, our snapshot
* should be the top one; pop it. Note that this could be a different
* snapshot from the one we made above; see EnsurePortalSnapshotExists.
*/

So, if the vacuum pops a snapshot when it needs to run without a
snapshot, then it's probably OK for other utilities. But I agree this
decision needs some consensus.

If recovery ends while a process is waiting for an LSN to arrive, does
it ever get woken up?

If the recovery target is promote, then the user will get an error.
If the recovery target is shutdown, then connection will get
interrupted. If the recovery target is pause, then waiting will
continue during the pause. Not sure about the latter case.

The docs could use some-copy-editing, but just to point out one issue:

There are also procedures to control the progress of recovery.

That's copy-pasted from an earlier sentence at the table that lists
functions like pg_promote(), pg_wal_replay_pause(), and
pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control the
progress of recovery like those functions do, it only causes the calling
backend to wait.

Overall, this feature doesn't feel quite ready for v17, and IMHO should
be reverted. It's a nice feature, so I'd love to have it fixed and
reviewed early in the v18 cycle.

Thank you for your review. I've reverted this. Will repost this for early v18.

------
Regards,
Alexander Korotkov

#92Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alexander Korotkov (#91)

On 11/04/2024 18:09, Alexander Korotkov wrote:

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 07/04/2024 00:52, Alexander Korotkov wrote:

* At first, we check that pg_wal_replay_wait() is called in a non-atomic
* context. That is, a procedure call isn't wrapped into a transaction,
* another procedure call, or a function call.
*

It's pretty unfortunate to have all these restrictions. It would be nice
to do:

select pg_wal_replay_wait('0/1234'); select * from foo;

This works for me, except that it needs "call" not "select".

# call pg_wal_replay_wait('0/1234'); select * from foo;
CALL
i
---
(0 rows)

If you do that from psql prompt, it works because psql parses and sends
it as two separate round-trips. Try:

psql postgres -p 5433 -c "call pg_wal_replay_wait('0/4101BBB8'); select 1"
ERROR: pg_wal_replay_wait() must be only called in non-atomic context
DETAIL: Make sure pg_wal_replay_wait() isn't called within a
transaction, another procedure, or a function.

This assumption that PortalRunUtility() can tolerate us popping the
snapshot sounds very fishy. I haven't looked at what's going on there,
but doesn't sound like a great assumption.

This is what PortalRunUtility() says about this.

/*
* Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
* under us, so don't complain if it's now empty. Otherwise, our snapshot
* should be the top one; pop it. Note that this could be a different
* snapshot from the one we made above; see EnsurePortalSnapshotExists.
*/

So, if the vacuum pops a snapshot when it needs to run without a
snapshot, then it's probably OK for other utilities. But I agree this
decision needs some consensus.

Ok, so it's at least somewhat documented that it's fine.

Overall, this feature doesn't feel quite ready for v17, and IMHO should
be reverted. It's a nice feature, so I'd love to have it fixed and
reviewed early in the v18 cycle.

Thank you for your review. I've reverted this. Will repost this for early v18.

Thanks Alexander for working on this.

--
Heikki Linnakangas
Neon (https://neon.tech)

#93Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Alexander Korotkov (#91)
1 attachment(s)

Hi, Alexander, Here, I made some improvements according to your
discussion with Heikki.

On 2024-04-11 18:09, Alexander Korotkov wrote:

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

In a nutshell, it's possible for the loop in WaitForLSN to exit
without
cleaning up the process from the heap. I was able to hit that by
adding
a delay after the addLSNWaiter() call:

TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]

I think there's a similar race condition if the timeout is reached at
the same time that the startup process wakes up the process.

Thank you for catching this. I think WaitForLSN() just needs to call
deleteLSNWaiter() unconditionally after exit from the loop.

Fix and add injection point test on this race condition.

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

The docs could use some-copy-editing, but just to point out one issue:

There are also procedures to control the progress of recovery.

That's copy-pasted from an earlier sentence at the table that lists
functions like pg_promote(), pg_wal_replay_pause(), and
pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control
the
progress of recovery like those functions do, it only causes the
calling
backend to wait.

Fix documentation and add extra tests on multi-standby replication
and cascade replication.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v18-0001-Implement-pg_wal_replay_wait-stored-procedure.patchtext/x-diff; name=v18-0001-Implement-pg_wal_replay_wait-stored-procedure.patchDownload
commit ddbee0e9fcd1f0a2a770d476acef677486f62367
Author: i.kartyshov <i.kartyshov@postrgespro.ru>
Date:   Wed Jun 12 01:29:00 2024 +0300

    Subject: [PATCH v18] Implement pg_wal_replay_wait() stored procedure
    
    pg_wal_replay_wait() is to be used on standby and specifies waiting for
    the specific WAL location to be replayed before starting the transaction.
    This option is useful when the user makes some data changes on primary and
    needs a guarantee to see these changes on standby.
    
    The queue of waiters is stored in the shared memory array sorted by LSN.
    During replay of WAL waiters whose LSNs are already replayed are deleted from
    the shared memory array and woken up by setting of their latches.
    
    pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
    the snapshot could prevent the replay of WAL records implying a kind of
    self-deadlock.  This is why it is only possible to implement
    pg_wal_replay_wait() as a procedure working in a non-atomic context,
    not a function.
    
    Catversion is bumped.
    
    Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
    Author: Kartyshov Ivan, Alexander Korotkov
    Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
    Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
    ---
     doc/src/sgml/func.sgml                                    | 112 ++++++++++++
     src/backend/access/transam/xact.c                         |   6 +
     src/backend/access/transam/xlog.c                         |   7 +
     src/backend/access/transam/xlogrecovery.c                 |  11 ++
     src/backend/catalog/system_functions.sql                  |   3 +
     src/backend/commands/Makefile                             |   3 +-
     src/backend/commands/meson.build                          |   1 +
     src/backend/commands/waitlsn.c                            | 344 +++++++++++++++++++++++++++++++++++++
     src/backend/lib/pairingheap.c                             |  18 +-
     src/backend/storage/ipc/ipci.c                            |   3 +
     src/backend/storage/lmgr/proc.c                           |   6 +
     src/backend/utils/activity/wait_event_names.txt           |   2 +
     src/include/catalog/pg_proc.dat                           |   5 +
     src/include/commands/waitlsn.h                            |  77 +++++++++
     src/include/lib/pairingheap.h                             |   3 +
     src/include/storage/lwlocklist.h                          |   1 +
     src/test/recovery/meson.build                             |   2 +
     src/test/recovery/t/043_wal_replay_wait.pl                | 176 +++++++++++++++++++
     src/test/recovery/t/044_wal_replay_wait_injection_test.pl |  98 +++++++++++
     src/tools/pgindent/typedefs.list                          |   2 +
     20 files changed, 877 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 17c44bc338..b36c5b2ad8 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28762,6 +28762,118 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    The procedure shown in <xref linkend="procedures-recovery-control-table"/>
+    can be executed only during recovery.
+   </para>
+
+   <table id="procedures-recovery-control-table">
+    <title>Recovery Procedures</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        If <parameter>timeout</parameter> is not specified or zero, this
+        procedure returns once WAL is replayed upto
+        <literal>target_lsn</literal>.
+        If the <parameter>timeout</parameter> is specified (in
+        milliseconds) and greater than zero, the procedure waits until the
+        server actually replays the WAL upto <literal>target_lsn</literal> or
+        until the given time has passed. On timeout, an error is emitted.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 100);
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+    </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can't be used within
+     the transaction, another procedure or function.  Any of them holds a
+     snapshot, which could prevent the replay of WAL records.
+
+   <programlisting>
+postgres=# BEGIN;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+ERROR:  pg_wal_replay_wait() must be only called in non-atomic context
+DETAIL:  Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function.
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9bda1aa6bc..d9546f7761 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -38,6 +38,7 @@
 #include "commands/async.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
+#include "commands/waitlsn.h"
 #include "common/pg_prng.h"
 #include "executor/spi.h"
 #include "libpq/be-fsstubs.h"
@@ -2771,6 +2772,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f..ffc0f3d239 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/waitlsn.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -6129,6 +6130,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before achieving the target LSN.
+	 */
+	WaitLSNSetLatches(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b45b833172..a96603cdd7 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSN->minWaitedLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index ae099e328c..623b9539b1 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..bd28a2fe32
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,344 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/injection_point.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/wait_event_types.h"
+
+static int	lsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+					void *arg);
+
+struct WaitLSNState *waitLSN = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSN->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSN->waitersHeap, lsn_cmp, NULL);
+		memset(&waitLSN->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+lsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSN->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSN->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSN->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSN->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procnum = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSN->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSN->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSN->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
+ * LSN waiters when InvalidXLogRecPtr is given.
+ */
+void
+WaitLSNSetLatches(XLogRecPtr currentLSN)
+{
+	int			i;
+	int		   *wakeUpProcNums;
+	int			numWakeUpProcs = 0;
+
+	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to set their latches later.
+	 */
+	while (!pairingheap_is_empty(&waitLSN->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSN->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		wakeUpProcNums[numWakeUpProcs++] = procInfo->procnum;
+		(void) pairingheap_remove_first(&waitLSN->waitersHeap);
+		procInfo->inHeap = false;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		backend = GetPGProcByNumber(wakeUpProcNums[i]);
+		SetLatch(&backend->procLatch);
+	}
+	pfree(wakeUpProcNums);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSN->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	/* If target LSN is already replayed, exit immediately */
+	if (targetLSN <= GetXLogReplayRecPtr(NULL))
+		return;
+
+	if (timeout > 0)
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		int			latch_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		currentLSN = GetXLogReplayRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is not in progress"),
+					 errdetail("Recovery ended before replaying the target LSN %X/%X; last replay LSN %X/%X.",
+							   LSN_FORMAT_ARGS(targetLSN),
+							   LSN_FORMAT_ARGS(currentLSN))));
+
+		if (timeout > 0)
+		{
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+			latch_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Check if startup process is already replayed target lsn
+	 */
+	INJECTION_POINT("pg-wal-replay-wait");
+
+	deleteLSNWaiter();
+
+	if (targetLSN > currentLSN)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+						LSN_FORMAT_ARGS(targetLSN),
+						LSN_FORMAT_ARGS(currentLSN))));
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	int64		timeout = PG_GETARG_INT64(1);
+	CallContext *context = (CallContext *) fcinfo->context;
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we check that pg_wal_replay_wait() is called in a non-atomic
+	 * context.  That is, a procedure call isn't wrapped into a transaction,
+	 * another procedure call, or a function call.
+	 *
+	 * Secondly, according to PlannedStmtRequiresSnapshot(), even in an atomic
+	 * context, CallStmt is processed with a snapshot.  Thankfully, we can pop
+	 * this snapshot, because PortalRunUtility() can tolerate this.
+	 */
+	if (context->atomic)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called in non-atomic context"),
+				 errdetail("Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function.")));
+
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+	Assert(!ActiveSnapshotSet());
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	(void) WaitForLSN(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index fe1deba13e..7858e5e076 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418c..90c84fec27 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -357,6 +359,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ce29da9012..18e07bf77d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 87cbca2811..f079d660a4 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -87,6 +87,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -345,6 +346,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6a5476d3c4..5d83c4e919 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12156,6 +12156,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN with timeout',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..da17b8be6f
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,77 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "lib/pairingheap.h"
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo – the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/*
+	 * A process number, same as the index of this item in waitLSN->procInfos.
+	 * Stored for convenience.
+	 */
+	int			procnum;
+
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* A pairing heap node for participation in waitLSN->waitersHeap */
+	pairingheap_node phNode;
+
+	/* A flag indicating that this item is added to waitLSN->waitersHeap */
+	bool		inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr targetLSN, int64 timeout);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 7eade81535..9e1c26033a 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 85f6568b9e..c2bab5a794 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..3efe4645ac 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,8 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_replay_wait.pl',
+      't/044_wal_replay_wait_injection_test.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
new file mode 100644
index 0000000000..867c3dd94a
--- /dev/null
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -0,0 +1,176 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby1->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby1->start;
+
+# I
+# Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# II
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and is the same as primary's LSN
+ok($output eq 30, "standby reached the same LSN as primary");
+
+# III
+# Check two standby waiting LSN
+# Create a streaming second standby with a 1 second delay from the backup
+my $node_standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby2->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby2->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+my $output2 = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 40, "standby1 reached the same LSN as primary");
+ok($output2 eq 40, "standby2 reached the same LSN as primary");
+
+# IV
+# Create a cascading standby waiting LSN
+$backup_name = 'cas_backup';
+$node_standby1->backup($backup_name);
+
+my $cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+$cascading_standby->init_from_backup($node_standby1, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $cascading_connstr = $node_standby1->connstr;
+$cascading_standby->append_conf(
+	'postgresql.conf', qq(
+	hot_standby_feedback = on
+	recovery_min_apply_delay = '${delay}s'
+));
+
+$cascading_standby->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+$output2 = $cascading_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 50, "standby1 reached the same LSN as primary");
+ok($output2 eq 50, "cascading_standby reached the same LSN as primary");
+
+# V
+# Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby1->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 10);");
+$node_standby1->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+# VI
+# Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for unreachable LSN then promote.  Check the log for the relevant
+# error message.
+my $psql_session = $node_standby1->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn3}');
+]);
+
+my $log_offset = -s $node_standby1->logfile;
+$node_standby1->promote;
+$node_standby1->wait_for_log('recovery is not in progress', $log_offset);
+
+$node_standby1->stop;
+$node_standby2->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/044_wal_replay_wait_injection_test.pl b/src/test/recovery/t/044_wal_replay_wait_injection_test.pl
new file mode 100644
index 0000000000..2429ef0da2
--- /dev/null
+++ b/src/test/recovery/t/044_wal_replay_wait_injection_test.pl
@@ -0,0 +1,98 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use Test::More;
+
+##################################################
+# Test race condition when timeout reached and new wal replayed on standby.
+#
+# This test relies on an injection point that cause to wait before Delete
+# from heap and wait when new lsn added to heap.
+##################################################
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize primary node.
+my $node_primary = PostgreSQL::Test::Cluster->new('master');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf(
+	'postgresql.conf', q[
+restart_after_crash = on
+]);
+$node_primary->start;
+
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Setup first standby.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby1');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Setup second standby.
+my $node_standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby2->start;
+
+# Create table
+$node_primary->safe_psql('postgres', 'CREATE TABLE prim_tab (a int);');
+
+# Create extention injection point
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+# Wait until the extension has been created on the standby
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Attach the point in before deleteLSNWaiter.
+$node_standby->safe_psql('postgres',
+	"SELECT injection_points_attach('pg-wal-replay-wait', 'wait');");
+
+# Get log offset
+my $logstart = -s $node_standby->logfile;
+
+# Get current lsn from primary
+my $lsn_pr = $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000");
+
+# CALL pg_wal_replay_wait with timeout on standby1
+my $psql_session1 = $node_standby->background_psql('postgres');
+$psql_session1->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn_pr}',1000);
+]);
+
+# Wait till pg-wal-replay-wait apperied in pg_stat_activity
+$node_standby->wait_for_event('client backend', 'pg-wal-replay-wait');
+
+# Generate some WAL
+$node_primary->safe_psql('postgres', 'INSERT INTO prim_tab VALUES (1);');
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_primary->wait_for_replay_catchup($node_standby2);
+
+# CALL pg_wal_replay_wait with timeout on standby2
+my $psql_session2 = $node_standby2->background_psql('postgres');
+$psql_session2->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn_pr}',1000);
+]);
+
+# Wake up to check the message
+$node_standby->safe_psql('postgres',
+	"SELECT injection_points_wakeup('pg-wal-replay-wait');");
+
+# Check the log
+ok( $node_standby->log_contains(
+		"timed out while waiting for target LSN", $logstart),
+	"pg_wal_replay_wait timeout");
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4f57078d13..85859158cb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3106,6 +3106,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
#94Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#93)
1 attachment(s)

Hi, Ivan!

On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

Hi, Alexander, Here, I made some improvements according to your
discussion with Heikki.

On 2024-04-11 18:09, Alexander Korotkov wrote:

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

In a nutshell, it's possible for the loop in WaitForLSN to exit
without
cleaning up the process from the heap. I was able to hit that by
adding
a delay after the addLSNWaiter() call:

TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]

I think there's a similar race condition if the timeout is reached at
the same time that the startup process wakes up the process.

Thank you for catching this. I think WaitForLSN() just needs to call
deleteLSNWaiter() unconditionally after exit from the loop.

Fix and add injection point test on this race condition.

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

The docs could use some-copy-editing, but just to point out one issue:

There are also procedures to control the progress of recovery.

That's copy-pasted from an earlier sentence at the table that lists
functions like pg_promote(), pg_wal_replay_pause(), and
pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control
the
progress of recovery like those functions do, it only causes the
calling
backend to wait.

Fix documentation and add extra tests on multi-standby replication
and cascade replication.

Thank you for the revised patch.

I see couple of items which are not addressed in this revision.
* As Heikki pointed, that it's currently not possible in one round
trip to call call pg_wal_replay_wait() and do other job. The attached
patch addresses this. It milds the requirement. Now, it's not
necessary to be in atomic context. It's only necessary to have no
active snapshot. This new requirement works for me so far. I
appreciate a feedback on this.
* As Alvaro pointed, multiple waiters case isn't covered by the test
suite. That leads to no coverage of some code paths. The attached
patch addresses this by adding a test case with multiple waiters.

The rest looks good to me.

Links.
1. /messages/by-id/d1303584-b763-446c-9409-f4516118219f@iki.fi
2./messages/by-id/202404051815.eri4u5q6oj26@alvherre.pgsql

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v19-0001-Implement-pg_wal_replay_wait-stored-procedure.patchapplication/x-patch; name=v19-0001-Implement-pg_wal_replay_wait-stored-procedure.patchDownload
From 0229f55f8e63688f6660d07600dd6918e0e55b91 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 13 Jun 2024 00:39:45 +0300
Subject: [PATCH v19] Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working without an active snapshot,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
---
 doc/src/sgml/func.sgml                        | 112 ++++++
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 345 ++++++++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/commands/waitlsn.h                |  77 ++++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/test/recovery/meson.build                 |   2 +
 src/test/recovery/t/043_wal_replay_wait.pl    | 222 +++++++++++
 .../t/044_wal_replay_wait_injection_test.pl   |  98 +++++
 src/tools/pgindent/typedefs.list              |   2 +
 20 files changed, 924 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wal_replay_wait.pl
 create mode 100644 src/test/recovery/t/044_wal_replay_wait_injection_test.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 17c44bc3384..87149b0facd 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28762,6 +28762,118 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    The procedure shown in <xref linkend="procedures-recovery-control-table"/>
+    can be executed only during recovery.
+   </para>
+
+   <table id="procedures-recovery-control-table">
+    <title>Recovery Procedures</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        If <parameter>timeout</parameter> is not specified or zero, this
+        procedure returns once WAL is replayed upto
+        <literal>target_lsn</literal>.
+        If the <parameter>timeout</parameter> is specified (in
+        milliseconds) and greater than zero, the procedure waits until the
+        server actually replays the WAL upto <literal>target_lsn</literal> or
+        until the given time has passed. On timeout, an error is emitted.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 100);
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+    </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can't be used within
+     the transaction, another procedure or function.  Any of them holds a
+     snapshot, which could prevent the replay of WAL records.
+
+   <programlisting>
+postgres=# BEGIN;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+ERROR:  pg_wal_replay_wait() must be only called without active snapshot
+DETAIL:  Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function.
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4f4ce757623..b79b31d6708 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -38,6 +38,7 @@
 #include "commands/async.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
+#include "commands/waitlsn.h"
 #include "common/pg_prng.h"
 #include "executor/spi.h"
 #include "libpq/be-fsstubs.h"
@@ -2771,6 +2772,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f2..ffc0f3d2391 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/waitlsn.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -6129,6 +6130,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before achieving the target LSN.
+	 */
+	WaitLSNSetLatches(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index b45b8331720..a96603cdd76 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSN->minWaitedLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index ae099e328c2..623b9539b15 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..065720825d9
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,345 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/injection_point.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/wait_event_types.h"
+
+static int	lsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+					void *arg);
+
+struct WaitLSNState *waitLSN = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSN->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSN->waitersHeap, lsn_cmp, NULL);
+		memset(&waitLSN->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+lsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSN->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSN->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSN->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSN->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procnum = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSN->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSN->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSN->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
+ * LSN waiters when InvalidXLogRecPtr is given.
+ */
+void
+WaitLSNSetLatches(XLogRecPtr currentLSN)
+{
+	int			i;
+	int		   *wakeUpProcNums;
+	int			numWakeUpProcs = 0;
+
+	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to set their latches later.
+	 */
+	while (!pairingheap_is_empty(&waitLSN->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSN->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		wakeUpProcNums[numWakeUpProcs++] = procInfo->procnum;
+		(void) pairingheap_remove_first(&waitLSN->waitersHeap);
+		procInfo->inHeap = false;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		backend = GetPGProcByNumber(wakeUpProcNums[i]);
+		SetLatch(&backend->procLatch);
+	}
+	pfree(wakeUpProcNums);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSN->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	/* If target LSN is already replayed, exit immediately */
+	if (targetLSN <= GetXLogReplayRecPtr(NULL))
+		return;
+
+	if (timeout > 0)
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		int			latch_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		currentLSN = GetXLogReplayRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is not in progress"),
+					 errdetail("Recovery ended before replaying the target LSN %X/%X; last replay LSN %X/%X.",
+							   LSN_FORMAT_ARGS(targetLSN),
+							   LSN_FORMAT_ARGS(currentLSN))));
+
+		if (timeout > 0)
+		{
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+			latch_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, latch_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Check if startup process is already replayed target lsn
+	 */
+	INJECTION_POINT("pg-wal-replay-wait");
+
+	deleteLSNWaiter();
+
+	if (targetLSN > currentLSN)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+						LSN_FORMAT_ARGS(targetLSN),
+						LSN_FORMAT_ARGS(currentLSN))));
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	int64		timeout = PG_GETARG_INT64(1);
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/* Give up if there is still an active sanpshot. */
+	if (ActiveSnapshotSet())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called without active snapshot"),
+				 errdetail("Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function.")));
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	(void) WaitForLSN(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index fe1deba13ec..7858e5e076b 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 521ed5418cc..90c84fec27a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventExtensionShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -357,6 +359,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventExtensionShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ce29da90121..18e07bf77d1 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 87cbca28118..f079d660a46 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -87,6 +87,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -345,6 +346,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6a5476d3c4c..5d83c4e919d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12156,6 +12156,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait for LSN with timeout',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..da17b8be6f9
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,77 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "lib/pairingheap.h"
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo – the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/*
+	 * A process number, same as the index of this item in waitLSN->procInfos.
+	 * Stored for convenience.
+	 */
+	int			procnum;
+
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* A pairing heap node for participation in waitLSN->waitersHeap */
+	pairingheap_node phNode;
+
+	/* A flag indicating that this item is added to waitLSN->waitersHeap */
+	bool		inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr targetLSN, int64 timeout);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 7eade81535a..9e1c26033a1 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 85f6568b9e4..c2bab5a794e 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..3efe4645ac9 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,8 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_replay_wait.pl',
+      't/044_wal_replay_wait_injection_test.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
new file mode 100644
index 00000000000..e6700d8ac82
--- /dev/null
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -0,0 +1,222 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby1->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby1->start;
+
+# I
+# Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# II
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and is the same as primary's LSN
+ok($output eq 30, "standby reached the same LSN as primary");
+
+# III
+# Check two standby waiting LSN
+# Create a streaming second standby with a 1 second delay from the backup
+my $node_standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby2->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby2->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+my $output2 = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 40, "standby1 reached the same LSN as primary");
+ok($output2 eq 40, "standby2 reached the same LSN as primary");
+
+# IV
+# Create a cascading standby waiting LSN
+$backup_name = 'cas_backup';
+$node_standby1->backup($backup_name);
+
+my $cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+$cascading_standby->init_from_backup(
+	$node_standby1, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $cascading_connstr = $node_standby1->connstr;
+$cascading_standby->append_conf(
+	'postgresql.conf', qq(
+	hot_standby_feedback = on
+	recovery_min_apply_delay = '${delay}s'
+));
+
+$cascading_standby->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+$output2 = $cascading_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the current LSN on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 50, "standby1 reached the same LSN as primary");
+ok($output2 eq 50, "cascading_standby reached the same LSN as primary");
+
+# V
+# Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby1->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 10);");
+$node_standby1->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+
+# VI
+# Also, check the scenario of multiple LSN waiters.  We make 5 background psql
+# sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 51 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+for (my $i = 0; $i < 5; $i++)
+{
+	print($i);
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	my $psql_session = $node_standby1->background_psql('postgres');
+	$psql_session->query_until(
+		qr/start/, qq[
+		\\echo start
+		CALL pg_wal_replay_wait('${lsn}');
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby1->logfile;
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby1->wait_for_log("count ${i}", $log_offset);
+}
+
+
+# VII
+# Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for unreachable LSN then promote.  Check the log for the relevant
+# error message.
+my $psql_session = $node_standby1->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn3}');
+]);
+
+$log_offset = -s $node_standby1->logfile;
+$node_standby1->promote;
+$node_standby1->wait_for_log('recovery is not in progress', $log_offset);
+
+$node_standby1->stop;
+$node_standby2->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/044_wal_replay_wait_injection_test.pl b/src/test/recovery/t/044_wal_replay_wait_injection_test.pl
new file mode 100644
index 00000000000..2429ef0da25
--- /dev/null
+++ b/src/test/recovery/t/044_wal_replay_wait_injection_test.pl
@@ -0,0 +1,98 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use Test::More;
+
+##################################################
+# Test race condition when timeout reached and new wal replayed on standby.
+#
+# This test relies on an injection point that cause to wait before Delete
+# from heap and wait when new lsn added to heap.
+##################################################
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize primary node.
+my $node_primary = PostgreSQL::Test::Cluster->new('master');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf(
+	'postgresql.conf', q[
+restart_after_crash = on
+]);
+$node_primary->start;
+
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Setup first standby.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby1');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Setup second standby.
+my $node_standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby2->start;
+
+# Create table
+$node_primary->safe_psql('postgres', 'CREATE TABLE prim_tab (a int);');
+
+# Create extention injection point
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+# Wait until the extension has been created on the standby
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Attach the point in before deleteLSNWaiter.
+$node_standby->safe_psql('postgres',
+	"SELECT injection_points_attach('pg-wal-replay-wait', 'wait');");
+
+# Get log offset
+my $logstart = -s $node_standby->logfile;
+
+# Get current lsn from primary
+my $lsn_pr = $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000");
+
+# CALL pg_wal_replay_wait with timeout on standby1
+my $psql_session1 = $node_standby->background_psql('postgres');
+$psql_session1->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn_pr}',1000);
+]);
+
+# Wait till pg-wal-replay-wait apperied in pg_stat_activity
+$node_standby->wait_for_event('client backend', 'pg-wal-replay-wait');
+
+# Generate some WAL
+$node_primary->safe_psql('postgres', 'INSERT INTO prim_tab VALUES (1);');
+$node_primary->wait_for_replay_catchup($node_standby);
+$node_primary->wait_for_replay_catchup($node_standby2);
+
+# CALL pg_wal_replay_wait with timeout on standby2
+my $psql_session2 = $node_standby2->background_psql('postgres');
+$psql_session2->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn_pr}',1000);
+]);
+
+# Wake up to check the message
+$node_standby->safe_psql('postgres',
+	"SELECT injection_points_wakeup('pg-wal-replay-wait');");
+
+# Check the log
+ok( $node_standby->log_contains(
+		"timed out while waiting for target LSN", $logstart),
+	"pg_wal_replay_wait timeout");
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4f57078d133..85859158cb1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3106,6 +3106,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#95Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#94)

On Fri, Jun 14, 2024 at 3:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Jun 12, 2024 at 11:36 AM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

Hi, Alexander, Here, I made some improvements according to your
discussion with Heikki.

On 2024-04-11 18:09, Alexander Korotkov wrote:

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

In a nutshell, it's possible for the loop in WaitForLSN to exit
without
cleaning up the process from the heap. I was able to hit that by
adding
a delay after the addLSNWaiter() call:

TRAP: failed Assert("!procInfo->inHeap"), File: "../src/backend/commands/waitlsn.c", Line: 114, PID: 1936152
postgres: heikki postgres [local] CALL(ExceptionalCondition+0xab)[0x55da1f68787b]
postgres: heikki postgres [local] CALL(+0x331ec8)[0x55da1f204ec8]
postgres: heikki postgres [local] CALL(WaitForLSN+0x139)[0x55da1f2052cc]
postgres: heikki postgres [local] CALL(pg_wal_replay_wait+0x18b)[0x55da1f2056e5]
postgres: heikki postgres [local] CALL(ExecuteCallStmt+0x46e)[0x55da1f18031a]
postgres: heikki postgres [local] CALL(standard_ProcessUtility+0x8cf)[0x55da1f4b26c9]

I think there's a similar race condition if the timeout is reached at
the same time that the startup process wakes up the process.

Thank you for catching this. I think WaitForLSN() just needs to call
deleteLSNWaiter() unconditionally after exit from the loop.

Fix and add injection point test on this race condition.

On Thu, Apr 11, 2024 at 1:46 AM Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

The docs could use some-copy-editing, but just to point out one issue:

There are also procedures to control the progress of recovery.

That's copy-pasted from an earlier sentence at the table that lists
functions like pg_promote(), pg_wal_replay_pause(), and
pg_is_wal_replay_paused(). The pg_wal_replay_wait() doesn't control
the
progress of recovery like those functions do, it only causes the
calling
backend to wait.

Fix documentation and add extra tests on multi-standby replication
and cascade replication.

Thank you for the revised patch.

I see couple of items which are not addressed in this revision.
* As Heikki pointed, that it's currently not possible in one round
trip to call call pg_wal_replay_wait() and do other job. The attached
patch addresses this. It milds the requirement. Now, it's not
necessary to be in atomic context. It's only necessary to have no
active snapshot. This new requirement works for me so far. I
appreciate a feedback on this.
* As Alvaro pointed, multiple waiters case isn't covered by the test
suite. That leads to no coverage of some code paths. The attached
patch addresses this by adding a test case with multiple waiters.

The rest looks good to me.

Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.

1. It's not clear why this test needs node_standby2 at all. It seems useless.
2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This
location seems to be unachievable in this test. So, it's not clear
what race condition this test could potentially detect.
3. I think it would make sense to check for the race condition
reported by Heikki. That is to insert the injection point at the
beginning of WaitLSNSetLatches().

Links.
1. /messages/by-id/CAPpHfdvGRssjqwX1+idm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw@mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

#96Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alexander Korotkov (#95)

Hi, I looked through the patch and have some comments.

====== L68:
+ <title>Recovery Procedures</title>

It looks somewhat confusing and appears as if the section is intended
to explain how to perform recovery. Since this is the first built-in
procedure, I'm not sure how should this section be written. However,
the section immediately above is named "Recovery Control Functions",
so "Reocvery Synchronization Functions" would align better with the
naming of the upper section. (I don't believe we need to be so strcit
about the distinction between functions and procedures here.)

It looks strange that the procedure signature includes the return type.

====== L93:
+        If <parameter>timeout</parameter> is not specified or zero, this
+        procedure returns once WAL is replayed upto
+        <literal>target_lsn</literal>.
+        If the <parameter>timeout</parameter> is specified (in
+        milliseconds) and greater than zero, the procedure waits until the
+        server actually replays the WAL upto <literal>target_lsn</literal> or
+        until the given time has passed. On timeout, an error is emitted.

The first sentence should mention the main functionality. Following
precedents, it might be better to use something like "Waits until
recovery surpasses the specified LSN. If no timeout is specified or it
is set to zero, this procedure waits indefinitely for the LSN. If the
timeout is specified (in milliseconds) and is greater than zero, the
procedure waits until the LSN is reached or the specified time has
elapsed. On timeout, or if the server is promoted before the LSN is
reached, an error is emitted."

The detailed explanation that follows the above seems somewhat too
verbose to me, as other functions don't have such detailed examples.

====== L484
/*
+	 * Set latches for processes, whose waited LSNs are already replayed. This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */

Here, I'm not quite sure what specifically spinlock (or mutex?) is
referring to. However, more importantly, shouldn't we explain that it
is okay not to use any locks at all, rather than mentioning that
spinlocks should not be used here? I found a similar case around
freelist.c:238, which is written as follows.

* Not acquiring ProcArrayLock here which is slightly icky. It's
* actually fine because procLatch isn't ever freed, so we just can
* potentially set the wrong process' (or no process') latch.
*/
SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);

===== L518
+void
+WaitForLSN(XLogRecPtr targetLSN, int64 timeout)

This function is only called within the same module. I'm not sure if
we need to expose it. I we do, the name should probably be more
specific. I'm not quite sure if the division of functionality between
this function and its only caller function is appropriate. As a
possible refactoring, we could have WaitForLSN() just return the
result as [reached, timedout, promoted] and delegate prerequisition
checks and error reporting to the SQL function.

===== L524
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);

Seeing this assertion, I feel that the name "waitLSN" is a bit
obscure. How about renaming it to "waitLSNStates"?

===== L527
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);

This is somewhat excessive, causing a server crash when ending with an
error would suffice. By the way, if I execute "CALL
pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes.
The condition doesn't seem appropriate.

===== L561
+ errdetail("Recovery ended before replaying the target LSN %X/%X; last replay LSN %X/%X.",

I don't think we need "the" before "target" in the above message.

===== L565
+		if (timeout > 0)
+		{
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+			latch_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;

"timeout" is immutable in the function. Therefore, we can calculate
"latch_events" before entering the loop. By the way, the name
'latch_events' seems a bit off. Latch is a kind of event the function
can wait for. Therefore, something like wait_events might be more
appropriate.

===== L567
+ delay_ms = (endtime - GetCurrentTimestamp()) / 1000;

We can use TimestampDifferenceMilliseconds() here.

==== L578
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);

I think we usually reset latches unconditionally after returning from
WaitLatch(), even when waiting for timeouts.

===== L756
+{ oid => '16387', descr => 'wait for LSN with timeout',

The description seems a bit off. While timeout is mentioned, the more
important characteristic that the LSN is the replay LSN is not
included.

===== L791
+ * WaitLSNProcInfo [e2 80 93] the shared memory structure representing information

This line contains a non-ascii character EN Dash(U+2013).

===== L798
+	 * A process number, same as the index of this item in waitLSN->procInfos.
+	 * Stored for convenience.
+	 */
+	int			procnum;

It is described as "(just) for convenience". However, it is referenced
by Startup to fetch the PGPROC entry for the waiter, which necessary
for Startup. That aside, why don't we hold (the pointer to) procLatch
instead of procnum? It makes things simpler and I believe it is our
standard practice.

===== L809
+	/* A flag indicating that this item is added to waitLSN->waitersHeap */
+	bool		inHeap;

The name "inHeap" seems too leteral and might be hard to understand in
most occurances. How about using "waiting" instead?

===== L920
+# I
+# Make sure that pg_wal_replay_wait() works: add new content to

Hmm. I feel that Arabic numerals look nicer than Greek ones here.

===== L940
+# Check that new data is visible after calling pg_wal_replay_wait()

On the other hand, the comment for the check for this test states that

+# Make sure the current LSN on standby and is the same as primary's
LSN +ok($output eq 30, "standby reached the same LSN as primary");

I think the first comment and the second should be consistent.

Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.

1. It's not clear why this test needs node_standby2 at all. It seems useless.

I agree with you. What we would need is a second *waiter client*
connecting to the same stanby rather than a second standby. I feel
like having a test where the first waiter is removed while multiple
waiters are waiting, as well as a test where promotion occurs under
the same circumstances.

2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This
location seems to be unachievable in this test. So, it's not clear
what race condition this test could potentially detect.
3. I think it would make sense to check for the race condition
reported by Heikki. That is to insert the injection point at the
beginning of WaitLSNSetLatches().

I think the race condition you mentioned refers to the inconsistency
between the inHeap flag and the pairing heap caused by a race
condition between timeout and wakeup (or perhaps other combinations?
I'm not sure which version of the patch the mentioned race condition
refers to). However, I imagine it is difficult to reliably reproduce
this condition. In that regard, in the latest patch, the coherence
between the inHeap flag and the pairing heap is protected by LWLock,
so I believe we no longer need that test.

regards.

Links.
1. /messages/by-id/CAPpHfdvGRssjqwX1+idm5Tu-eWsTcx6DthB2LhGqA1tZ29jJaw@mail.gmail.com

--
Kyotaro Horiguchi
NTT Open Source Software Center

#97Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Kyotaro Horiguchi (#96)
1 attachment(s)

Thank you for your interest in the patch.

On 2024-06-20 11:30, Kyotaro Horiguchi wrote:

Hi, I looked through the patch and have some comments.

====== L68:
+ <title>Recovery Procedures</title>

It looks somewhat confusing and appears as if the section is intended
to explain how to perform recovery. Since this is the first built-in
procedure, I'm not sure how should this section be written. However,
the section immediately above is named "Recovery Control Functions",
so "Reocvery Synchronization Functions" would align better with the
naming of the upper section. (I don't believe we need to be so strcit
about the distinction between functions and procedures here.)

It looks strange that the procedure signature includes the return type.

Good point, change
Recovery Procedures -> Recovery Synchronization Procedures

====== L93:
+        If <parameter>timeout</parameter> is not specified or zero, 
this
+        procedure returns once WAL is replayed upto
+        <literal>target_lsn</literal>.
+        If the <parameter>timeout</parameter> is specified (in
+        milliseconds) and greater than zero, the procedure waits until 
the
+        server actually replays the WAL upto 
<literal>target_lsn</literal> or
+        until the given time has passed. On timeout, an error is 
emitted.

The first sentence should mention the main functionality. Following
precedents, it might be better to use something like "Waits until
recovery surpasses the specified LSN. If no timeout is specified or it
is set to zero, this procedure waits indefinitely for the LSN. If the
timeout is specified (in milliseconds) and is greater than zero, the
procedure waits until the LSN is reached or the specified time has
elapsed. On timeout, or if the server is promoted before the LSN is
reached, an error is emitted."

The detailed explanation that follows the above seems somewhat too
verbose to me, as other functions don't have such detailed examples.

Please offer your description. I think it would be better.

====== L484
/*
+	 * Set latches for processes, whose waited LSNs are already replayed. 
This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */

Here, I'm not quite sure what specifically spinlock (or mutex?) is
referring to. However, more importantly, shouldn't we explain that it
is okay not to use any locks at all, rather than mentioning that
spinlocks should not be used here? I found a similar case around
freelist.c:238, which is written as follows.

* Not acquiring ProcArrayLock here which is slightly icky. It's
* actually fine because procLatch isn't ever freed, so we just can
* potentially set the wrong process' (or no process') latch.
*/
SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);

???

===== L518
+void
+WaitForLSN(XLogRecPtr targetLSN, int64 timeout)

This function is only called within the same module. I'm not sure if
we need to expose it. I we do, the name should probably be more
specific. I'm not quite sure if the division of functionality between
this function and its only caller function is appropriate. As a
possible refactoring, we could have WaitForLSN() just return the
result as [reached, timedout, promoted] and delegate prerequisition
checks and error reporting to the SQL function.

waitLSN -> waitLSNStates
No, waitLSNStates is not the best name, because waitLSNState is a state,
and waitLSN is not the array of waitLSNStates. We can think about
another name, what you think?

===== L524
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);

Seeing this assertion, I feel that the name "waitLSN" is a bit
obscure. How about renaming it to "waitLSNStates"?

===== L527
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);

This is somewhat excessive, causing a server crash when ending with an
error would suffice. By the way, if I execute "CALL
pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes.
The condition doesn't seem appropriate.

Can you give more information on your server crashes, so I could repeat
them.

===== L565
+		if (timeout > 0)
+		{
+			delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+			latch_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;

"timeout" is immutable in the function. Therefore, we can calculate
"latch_events" before entering the loop. By the way, the name
'latch_events' seems a bit off. Latch is a kind of event the function
can wait for. Therefore, something like wait_events might be more
appropriate.

"wait_event" - it can't be, because in latch declaration, this events
responsible for wake up and not for wait
int WaitLatch(Latch *latch, int wakeEvents, long timeout, uint32
wait_event_info)

==== L578
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);

I think we usually reset latches unconditionally after returning from
WaitLatch(), even when waiting for timeouts.

No, it depends on you logic, when you have several wake_events and you
want to choose what event ignited your latch.
Check applyparallelworker.c:813

===== L798
+	 * A process number, same as the index of this item in 
waitLSN->procInfos.
+	 * Stored for convenience.
+	 */
+	int			procnum;

It is described as "(just) for convenience". However, it is referenced
by Startup to fetch the PGPROC entry for the waiter, which necessary
for Startup. That aside, why don't we hold (the pointer to) procLatch
instead of procnum? It makes things simpler and I believe it is our
standard practice.

Can you deeper explane what you meen and give the example?

===== L809
+	/* A flag indicating that this item is added to waitLSN->waitersHeap 
*/
+	bool		inHeap;

The name "inHeap" seems too leteral and might be hard to understand in
most occurances. How about using "waiting" instead?

No, inHeap leteral mean indeed inHeap. Check the comment.
Please suggest a more suitable one.

===== L940
+# Check that new data is visible after calling pg_wal_replay_wait()

On the other hand, the comment for the check for this test states that

+# Make sure the current LSN on standby and is the same as primary's
LSN +ok($output eq 30, "standby reached the same LSN as primary");

I think the first comment and the second should be consistent.

Thanks, I'll rephrase this comment

Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.

1. It's not clear why this test needs node_standby2 at all. It seems
useless.

I agree with you. What we would need is a second *waiter client*
connecting to the same stanby rather than a second standby. I feel
like having a test where the first waiter is removed while multiple
waiters are waiting, as well as a test where promotion occurs under
the same circumstances.

Can you give more information about this cases step by step, and what
means "remove" and "promotion".

2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This
location seems to be unachievable in this test. So, it's not clear
what race condition this test could potentially detect.
3. I think it would make sense to check for the race condition
reported by Heikki. That is to insert the injection point at the
beginning of WaitLSNSetLatches().

I think the race condition you mentioned refers to the inconsistency
between the inHeap flag and the pairing heap caused by a race
condition between timeout and wakeup (or perhaps other combinations?
I'm not sure which version of the patch the mentioned race condition
refers to). However, I imagine it is difficult to reliably reproduce
this condition. In that regard, in the latest patch, the coherence
between the inHeap flag and the pairing heap is protected by LWLock,
so I believe we no longer need that test.

No, Alexandre means that Heikki point on race condition just before
LWLock. But injection point we can inject and stepin on backend, and
WaitLSNSetLatches is used from Recovery process. But I have trouble
to wakeup injection point on server.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

Attachments:

v20-Implement-pg_wal_replay_wait-store.patchtext/x-diff; name=v20-Implement-pg_wal_replay_wait-store.patchDownload
From a825fbe7262d8405fe05f29aa94cbdc34f1c6f28 Mon Sep 17 00:00:00 2001
From: "i.kartyshov" <i.kartyshov@postrgespro.ru>
Date: Mon, 8 Jul 2024 20:42:19 +0300
Subject: [PATCH] Subject: [PATCH v20] Implement pg_wal_replay_wait() stored
 procedure

pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working without an active snapshot,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
---
 doc/src/sgml/func.sgml                        | 109 ++++++
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 346 ++++++++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/commands/waitlsn.h                |  77 ++++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  22 ++
 src/test/recovery/meson.build                 |   2 +
 src/test/recovery/t/043_wal_replay_wait.pl    | 222 +++++++++++
 .../t/044_wal_replay_wait_injection_test.pl   |  86 +++++
 src/tools/pgindent/typedefs.list              |   2 +
 21 files changed, 932 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wal_replay_wait.pl
 create mode 100644 src/test/recovery/t/044_wal_replay_wait_injection_test.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 93ee3d4b60..12117df1ab 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28816,6 +28816,115 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    The procedure shown in <xref linkend="procedures-recovery-control-table"/>
+    can be executed only during recovery.
+   </para>
+
+   <table id="recovery-synchronization-procedure-table">
+    <title>Recovery Synchronization Procedure</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        If <parameter>timeout</parameter> is not specified or zero, this
+        procedure returns once WAL is replayed upto
+        <literal>target_lsn</literal>.
+        If the <parameter>timeout</parameter> is specified (in
+        milliseconds) and greater than zero, the procedure waits until the
+        server actually replays the WAL upto <literal>target_lsn</literal> or
+        until the given time has passed. On timeout, an error is emitted.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 100);
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+    </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can be used within
+     the transaction.
+
+   <programlisting>
+postgres=# BEGIN;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d119ab909d..dfc8cf2dcf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -38,6 +38,7 @@
 #include "commands/async.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
+#include "commands/waitlsn.h"
 #include "common/pg_prng.h"
 #include "executor/spi.h"
 #include "libpq/be-fsstubs.h"
@@ -2809,6 +2810,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 33e27a6e72..7c82e2cd8b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/waitlsn.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -6130,6 +6131,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before achieving the target LSN.
+	 */
+	WaitLSNSetLatches(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2ed3ea2b45..443b6c2802 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSN &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSN->minWaitedLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index ae099e328c..623b9539b1 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..cede90c3b9 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abd..7549be5dc3 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 0000000000..a545b6f230
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,346 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/injection_point.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/wait_event_types.h"
+
+static int	lsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+					void *arg);
+
+struct WaitLSNState *waitLSN = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSN = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+											   WaitLSNShmemSize(),
+											   &found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSN->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSN->waitersHeap, lsn_cmp, NULL);
+		memset(&waitLSN->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+lsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSN->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSN->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSN->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSN->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->procnum = MyProcNumber;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSN->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSN->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSN->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
+ * LSN waiters when InvalidXLogRecPtr is given.
+ */
+void
+WaitLSNSetLatches(XLogRecPtr currentLSN)
+{
+	int			i;
+	int		   *wakeUpProcNums;
+	int			numWakeUpProcs = 0;
+
+	wakeUpProcNums = palloc(sizeof(int) * MaxBackends);
+
+	/*
+	 * Check if startup process is already replayed target lsn
+	 */
+	INJECTION_POINT("pg-wal-replay-set-latch");
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process numbers to set their latches later.
+	 */
+	while (!pairingheap_is_empty(&waitLSN->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSN->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		wakeUpProcNums[numWakeUpProcs++] = procInfo->procnum;
+		(void) pairingheap_remove_first(&waitLSN->waitersHeap);
+		procInfo->inHeap = false;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. This
+	 * involves spinlocks.  So, we shouldn't do this under a spinlock.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		PGPROC	   *backend;
+
+		backend = GetPGProcByNumber(wakeUpProcNums[i]);
+		SetLatch(&backend->procLatch);
+	}
+	pfree(wakeUpProcNums);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSN->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+void
+WaitForLSN(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSN);
+
+	/* Should be only called by a backend */
+	Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	/* If target LSN is already replayed, exit immediately */
+	if (targetLSN <= GetXLogReplayRecPtr(NULL))
+		return;
+
+	if (timeout > 0)
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		int			wake_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		currentLSN = GetXLogReplayRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is not in progress"),
+					 errdetail("Recovery ended before replaying the target LSN %X/%X; last replay LSN %X/%X.",
+							   LSN_FORMAT_ARGS(targetLSN),
+							   LSN_FORMAT_ARGS(currentLSN))));
+
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(
+						GetCurrentTimestamp(),endtime);
+			wake_events |= WL_TIMEOUT;
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	deleteLSNWaiter();
+
+	if (targetLSN > currentLSN)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+						LSN_FORMAT_ARGS(targetLSN),
+						LSN_FORMAT_ARGS(currentLSN))));
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	int64		timeout = PG_GETARG_INT64(1);
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/* Give up if there is still an active sanpshot. */
+	if (ActiveSnapshotSet())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called without active snapshot"),
+				 errdetail("Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function.")));
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	(void) WaitForLSN(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index fe1deba13e..7858e5e076 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2100150f01..b180ae97af 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -357,6 +359,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1b23efb26f..ac66da8638 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index db37beeaae..d10ca723dc 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -87,6 +87,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -345,6 +346,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0bf413fe05..1140ac5a07 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12161,6 +12161,11 @@
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '16387', descr => 'wait with timeout for target LSN to replayed on standby',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6291', descr => 'arbitrary value from among input values',
   proname => 'any_value', prokind => 'a', proisstrict => 'f',
   prorettype => 'anyelement', proargtypes => 'anyelement',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 0000000000..a19409e4b3
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,77 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "lib/pairingheap.h"
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/*
+	 * A process number, same as the index of this item in waitLSN->procInfos.
+	 * Stored for convenience.
+	 */
+	int			procnum;
+
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/* A pairing heap node for participation in waitLSN->waitersHeap */
+	pairingheap_node phNode;
+
+	/* A flag indicating that this item is added to waitLSN->waitersHeap */
+	bool		inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT struct WaitLSNState *waitLSN;
+
+extern void WaitForLSN(XLogRecPtr targetLSN, int64 timeout);
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 7eade81535..9e1c26033a 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54f..88dc79b2bd 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 0135c5a795..f05002e575 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2851,6 +2851,28 @@ sub wait_for_event
 
 =pod
 
+=item $node->wait_for_events(wait_event_name)
+
+Poll pg_stat_activity until reaches wait_event_name.
+
+=cut
+
+sub wait_for_events
+{
+	my ($self, $wait_event_name) = @_;
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT count(*) > 0 FROM pg_stat_activity
+		WHERE wait_event = '$wait_event_name'
+	])
+	  or die
+	  qq(timed out when waiting for to reach wait event '$wait_event_name');
+
+	return;
+}
+=pod
+
 =item $node->wait_for_catchup(standby_name, mode, target_lsn)
 
 Wait for the replication connection with application_name standby_name until
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec..3efe4645ac 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,8 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_replay_wait.pl',
+      't/044_wal_replay_wait_injection_test.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
new file mode 100644
index 0000000000..cc8fb2079a
--- /dev/null
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -0,0 +1,222 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby1->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby1->start;
+
+# 1.
+# Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# 2.
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby and is the same as primary's LSN
+ok($output eq 30, "standby reached the same LSN as primary");
+
+# 3.
+# Check two standby waiting LSN
+# Create a streaming second standby with a 1 second delay from the backup
+my $node_standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby2->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby2->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+my $output2 = $node_standby2->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 40, "standby1 reached the same LSN as primary");
+ok($output2 eq 40, "standby2 reached the same LSN as primary");
+
+# 4.
+# Create a cascading standby waiting LSN
+$backup_name = 'cas_backup';
+$node_standby1->backup($backup_name);
+
+my $cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+$cascading_standby->init_from_backup(
+	$node_standby1, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $cascading_connstr = $node_standby1->connstr;
+$cascading_standby->append_conf(
+	'postgresql.conf', qq(
+	hot_standby_feedback = on
+	recovery_min_apply_delay = '${delay}s'
+));
+
+$cascading_standby->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+$output2 = $cascading_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 50, "standby1 reached the same LSN as primary");
+ok($output2 eq 50, "cascading_standby reached the same LSN as primary");
+
+# 5.
+# Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby1->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 10);");
+$node_standby1->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+
+# 6.
+# Also, check the scenario of multiple LSN waiters.  We make 5 background psql
+# sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 51 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+for (my $i = 0; $i < 5; $i++)
+{
+	print($i);
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	my $psql_session = $node_standby1->background_psql('postgres');
+	$psql_session->query_until(
+		qr/start/, qq[
+		\\echo start
+		CALL pg_wal_replay_wait('${lsn}');
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby1->logfile;
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby1->wait_for_log("count ${i}", $log_offset);
+}
+
+
+# 7.
+# Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for unreachable LSN then promote.  Check the log for the relevant
+# error message.
+my $psql_session = $node_standby1->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn3}');
+]);
+
+$log_offset = -s $node_standby1->logfile;
+$node_standby1->promote;
+$node_standby1->wait_for_log('recovery is not in progress', $log_offset);
+
+$node_standby1->stop;
+$node_standby2->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/test/recovery/t/044_wal_replay_wait_injection_test.pl b/src/test/recovery/t/044_wal_replay_wait_injection_test.pl
new file mode 100644
index 0000000000..72d76df20b
--- /dev/null
+++ b/src/test/recovery/t/044_wal_replay_wait_injection_test.pl
@@ -0,0 +1,86 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use Test::More;
+
+##################################################
+# Test race condition when timeout reached and new wal replayed on standby.
+#
+# This test relies on an injection point that cause to wait before Delete
+# from heap and wait when new lsn added to heap.
+##################################################
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize primary node.
+my $node_primary = PostgreSQL::Test::Cluster->new('master');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf(
+	'postgresql.conf', q[
+restart_after_crash = on
+]);
+$node_primary->start;
+
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Setup first standby.
+my $node_standby = PostgreSQL::Test::Cluster->new('standby1');
+$node_standby->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Create table
+$node_primary->safe_psql('postgres', 'CREATE TABLE prim_tab (a int);');
+
+# Create extention injection point
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+# Wait until the extension has been created on the standby
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# Attach the point in before deleteLSNWaiter.
+$node_standby->safe_psql('postgres',
+	"SELECT injection_points_attach('pg-wal-replay-set-latch', 'wait');");
+
+# Get log offset
+my $logstart = -s $node_standby->logfile;
+
+# Get current lsn from primary
+my $lsn_pr = $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10;");
+
+$node_primary->safe_psql('postgres', 'INSERT INTO prim_tab VALUES (1);');
+# CALL pg_wal_replay_wait with timeout on standby
+my $psql_session1 = $node_standby->background_psql('postgres');
+$psql_session1->query_until(
+	qr/start/, qq[
+	CALL pg_wal_replay_wait('${lsn_pr}',10000);
+]);
+
+# Generate some WAL
+$node_primary->safe_psql('postgres', 'INSERT INTO prim_tab VALUES (1);');
+#$node_primary->wait_for_replay_catchup($node_standby);
+
+# Wait till pg-wal-replay-set-latch apperied in pg_stat_activity
+$node_standby->wait_for_events('client backend', 'pg-wal-replay-set-latch');
+$node_standby->safe_psql('postgres',
+	"SELECT count(*) > 0 FROM pg_stat_activity WHERE
+	wait_event_type = 'Extension' AND wait_event = 'pg-wal-replay-set-latch'");
+
+# Wake up to check the message
+$node_standby->safe_psql('postgres',
+	"SELECT injection_points_wakeup('pg-wal-replay-set-latch');");
+
+# Check the log
+ok( $node_standby->log_contains(
+		"timed out while waiting for target LSN", $logstart),
+	"pg_wal_replay_wait timeout");
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9320e4d808..1979b2f8f5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3107,6 +3107,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.34.1

#98Kartyshov Ivan
i.kartyshov@postgrespro.ru
In reply to: Kartyshov Ivan (#97)

I think the race condition you mentioned refers to the inconsistency
between the inHeap flag and the pairing heap caused by a race
condition between timeout and wakeup (or perhaps other combinations?
I'm not sure which version of the patch the mentioned race condition
refers to). However, I imagine it is difficult to reliably reproduce
this condition. In that regard, in the latest patch, the coherence
between the inHeap flag and the pairing heap is protected by LWLock,
so I believe we no longer need that test.

No, Alexandre means that Heikki point on race condition just before
LWLock. But injection point we can inject and stepin on backend, and
WaitLSNSetLatches is used from Recovery process. But I have trouble
to wakeup injection point on server.

One more thing. I want to point, when you move injection point to
contrib dir and do the same test (044_wal_replay_wait_injection_test.pl)
step by step with hands, wakeup works well.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.com

#99Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kartyshov Ivan (#97)
1 attachment(s)

Thanks to Kyotaro for the review. And thanks to Ivan for the patch
revision. I made another revision of the patch.

On Tue, Jul 9, 2024 at 12:51 PM Kartyshov Ivan
<i.kartyshov@postgrespro.ru> wrote:

Thank you for your interest in the patch.

On 2024-06-20 11:30, Kyotaro Horiguchi wrote:

Hi, I looked through the patch and have some comments.

====== L68:
+ <title>Recovery Procedures</title>

It looks somewhat confusing and appears as if the section is intended
to explain how to perform recovery. Since this is the first built-in
procedure, I'm not sure how should this section be written. However,
the section immediately above is named "Recovery Control Functions",
so "Reocvery Synchronization Functions" would align better with the
naming of the upper section. (I don't believe we need to be so strcit
about the distinction between functions and procedures here.)

It looks strange that the procedure signature includes the return type.

Good point, change
Recovery Procedures -> Recovery Synchronization Procedures

Thank you, looks good to me.

====== L93:
+        If <parameter>timeout</parameter> is not specified or zero,
this
+        procedure returns once WAL is replayed upto
+        <literal>target_lsn</literal>.
+        If the <parameter>timeout</parameter> is specified (in
+        milliseconds) and greater than zero, the procedure waits until
the
+        server actually replays the WAL upto
<literal>target_lsn</literal> or
+        until the given time has passed. On timeout, an error is
emitted.

The first sentence should mention the main functionality. Following
precedents, it might be better to use something like "Waits until
recovery surpasses the specified LSN. If no timeout is specified or it
is set to zero, this procedure waits indefinitely for the LSN. If the
timeout is specified (in milliseconds) and is greater than zero, the
procedure waits until the LSN is reached or the specified time has
elapsed. On timeout, or if the server is promoted before the LSN is
reached, an error is emitted."

The detailed explanation that follows the above seems somewhat too
verbose to me, as other functions don't have such detailed examples.

Please offer your description. I think it would be better.

Kyotaro actually provided a paragraph in his message. I've integrated
it into the patch.

====== L484
/*
+      * Set latches for processes, whose waited LSNs are already replayed.
This
+      * involves spinlocks.  So, we shouldn't do this under a spinlock.
+      */

Here, I'm not quite sure what specifically spinlock (or mutex?) is
referring to. However, more importantly, shouldn't we explain that it
is okay not to use any locks at all, rather than mentioning that
spinlocks should not be used here? I found a similar case around
freelist.c:238, which is written as follows.

* Not acquiring ProcArrayLock here which is slightly icky. It's
* actually fine because procLatch isn't ever freed, so we just can
* potentially set the wrong process' (or no process') latch.
*/
SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);

???

I've revised the comment.

===== L518
+void
+WaitForLSN(XLogRecPtr targetLSN, int64 timeout)

This function is only called within the same module. I'm not sure if
we need to expose it. I we do, the name should probably be more
specific. I'm not quite sure if the division of functionality between
this function and its only caller function is appropriate. As a
possible refactoring, we could have WaitForLSN() just return the
result as [reached, timedout, promoted] and delegate prerequisition
checks and error reporting to the SQL function.

I think WaitForLSNReplay() is better name. And it's not clear what
API we could need. So I would prefer to keep it static for now.

waitLSN -> waitLSNStates
No, waitLSNStates is not the best name, because waitLSNState is a state,
and waitLSN is not the array of waitLSNStates. We can think about
another name, what you think?

===== L524
+     /* Shouldn't be called when shmem isn't initialized */
+     Assert(waitLSN);

Seeing this assertion, I feel that the name "waitLSN" is a bit
obscure. How about renaming it to "waitLSNStates"?

I agree that waitLSN is too generic. I've renamed it to waitLSNState.

===== L527
+     /* Should be only called by a backend */
+     Assert(MyBackendType == B_BACKEND && MyProcNumber <= MaxBackends);

This is somewhat excessive, causing a server crash when ending with an
error would suffice. By the way, if I execute "CALL
pg_wal_replay_wait('0/0')" on a logical wansender, the server crashes.
The condition doesn't seem appropriate.

Can you give more information on your server crashes, so I could repeat
them.

I've rechecked this. We don't need to assert the process to be a
backend given that MaxBackends include background workers too.
Replaced this just with assert that MyProcNumber is valid.

===== L565
+             if (timeout > 0)
+             {
+                     delay_ms = (endtime - GetCurrentTimestamp()) / 1000;
+                     latch_events |= WL_TIMEOUT;
+                     if (delay_ms <= 0)
+                             break;

"timeout" is immutable in the function. Therefore, we can calculate
"latch_events" before entering the loop. By the way, the name
'latch_events' seems a bit off. Latch is a kind of event the function
can wait for. Therefore, something like wait_events might be more
appropriate.

"wait_event" - it can't be, because in latch declaration, this events
responsible for wake up and not for wait
int WaitLatch(Latch *latch, int wakeEvents, long timeout, uint32
wait_event_info)

I agree with change "latch_events" => "wake_events" by Ivan. I also
made change to calculate wake_events in advance as proposed by
Kyotaro.

==== L578
+             if (rc & WL_LATCH_SET)
+                     ResetLatch(MyLatch);

I think we usually reset latches unconditionally after returning from
WaitLatch(), even when waiting for timeouts.

No, it depends on you logic, when you have several wake_events and you
want to choose what event ignited your latch.
Check applyparallelworker.c:813

+1,
Not necessary to reset latch if it hasn't been set. A lot of places
where we do that conditionally.

===== L798
+      * A process number, same as the index of this item in
waitLSN->procInfos.
+      * Stored for convenience.
+      */
+     int                     procnum;

It is described as "(just) for convenience". However, it is referenced
by Startup to fetch the PGPROC entry for the waiter, which necessary
for Startup. That aside, why don't we hold (the pointer to) procLatch
instead of procnum? It makes things simpler and I believe it is our
standard practice.

Can you deeper explane what you meen and give the example?

I wrote the comment that we store procnum for convenience meaning that
alternatively we could calculate it as the offset in the
waitLSN->procInfos array. But I like your idea to keep latch instead.
This should give more flexibility for future advancements.

===== L809
+     /* A flag indicating that this item is added to waitLSN->waitersHeap
*/
+     bool            inHeap;

The name "inHeap" seems too leteral and might be hard to understand in
most occurances. How about using "waiting" instead?

No, inHeap leteral mean indeed inHeap. Check the comment.
Please suggest a more suitable one.

+1,
We use this flag for consistency during operations with heap. I don't
see a reason to rename this.

===== L940
+# Check that new data is visible after calling pg_wal_replay_wait()

On the other hand, the comment for the check for this test states that

+# Make sure the current LSN on standby and is the same as primary's
LSN +ok($output eq 30, "standby reached the same LSN as primary");

I think the first comment and the second should be consistent.

Thanks, I'll rephrase this comment

I had to revised this comment furthermore.

Oh, I forgot some notes about 044_wal_replay_wait_injection_test.pl.

1. It's not clear why this test needs node_standby2 at all. It seems
useless.

I agree with you. What we would need is a second *waiter client*
connecting to the same stanby rather than a second standby. I feel
like having a test where the first waiter is removed while multiple
waiters are waiting, as well as a test where promotion occurs under
the same circumstances.

Can you give more information about this cases step by step, and what
means "remove" and "promotion".

2. The target LSN is set to pg_current_wal_insert_lsn() + 10000. This
location seems to be unachievable in this test. So, it's not clear
what race condition this test could potentially detect.
3. I think it would make sense to check for the race condition
reported by Heikki. That is to insert the injection point at the
beginning of WaitLSNSetLatches().

I think the race condition you mentioned refers to the inconsistency
between the inHeap flag and the pairing heap caused by a race
condition between timeout and wakeup (or perhaps other combinations?
I'm not sure which version of the patch the mentioned race condition
refers to). However, I imagine it is difficult to reliably reproduce
this condition. In that regard, in the latest patch, the coherence
between the inHeap flag and the pairing heap is protected by LWLock,
so I believe we no longer need that test.

No, Alexandre means that Heikki point on race condition just before
LWLock. But injection point we can inject and stepin on backend, and
WaitLSNSetLatches is used from Recovery process. But I have trouble
to wakeup injection point on server.

My initial hope was to add elegant enough test that will check for
concurrency issues between startup process and lsn waiter process.
That test should check an issue reported by Heikki and hopefully more.
I've tried to revise 044_wal_replay_wait_injection_test.pl, but it
becomes cumbersome and finally unclear what it's intended to test.
This is why I finally decide to remote this.

Regarding test for multiple waiters, 043_wal_replay_wait.pl has some.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v21-0001-Implement-pg_wal_replay_wait-stored-procedure.patchapplication/octet-stream; name=v21-0001-Implement-pg_wal_replay_wait-stored-procedure.patchDownload
From eab9c9e7a8c4bd5a36d3281ff9424e6c9ea95cf1 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 15 Jul 2024 03:53:51 +0300
Subject: [PATCH v21] Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working without an active snapshot,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
 doc/src/sgml/func.sgml                        | 111 ++++++
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 341 ++++++++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/catalog/pg_proc.dat               |   6 +
 src/include/commands/waitlsn.h                |  80 ++++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wal_replay_wait.pl    | 222 ++++++++++++
 src/tools/pgindent/typedefs.list              |   2 +
 19 files changed, 824 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wal_replay_wait.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 785886af714..c44880f3a63 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28893,6 +28893,117 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    The procedure shown in <xref linkend="procedures-recovery-control-table"/>
+    can be executed only during recovery.
+   </para>
+
+   <table id="recovery-synchronization-procedure-table">
+    <title>Recovery Synchronization Procedure</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Waits until recovery replays <literal>target_lsn</literal>.
+        If no <parameter>timeout</parameter> is specified or it is set to
+        zero, this procedure waits indefinitely for the
+        <literal>target_lsn</literal>.  If the <parameter>timeout</parameter>
+        is specified (in milliseconds) and is greater than zero, the
+        procedure waits until <literal>target_lsn</literal> is reached or
+        the specified <parameter>timeout</parameter> has elapsed.
+        On timeout, or if the server is promoted before
+        <literal>target_lsn</literal> is reached, an error is emitted.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 100);
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+    </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can be used within
+     the transaction.
+
+   <programlisting>
+postgres=# BEGIN;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d119ab909dc..dfc8cf2dcf2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -38,6 +38,7 @@
 #include "commands/async.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
+#include "commands/waitlsn.h"
 #include "common/pg_prng.h"
 #include "executor/spi.h"
 #include "libpq/be-fsstubs.h"
@@ -2809,6 +2810,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 33e27a6e72c..7c82e2cd8b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/waitlsn.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -6130,6 +6131,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before achieving the target LSN.
+	 */
+	WaitLSNSetLatches(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2ed3ea2b45b..ad817fbca67 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index ae099e328c2..623b9539b15 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..89dd5a9ccd3
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,341 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/wait_event_types.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->latch = MyLatch;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
+ * LSN waiters when InvalidXLogRecPtr is given.
+ */
+void
+WaitLSNSetLatches(XLogRecPtr currentLSN)
+{
+	int			i;
+	Latch	  **wakeUpProcLatches;
+	int			numWakeUpProcs = 0;
+
+	wakeUpProcLatches = palloc(sizeof(Latch *) * MaxBackends);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process latches to set them later.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		wakeUpProcLatches[numWakeUpProcs++] = procInfo->latch;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		SetLatch(wakeUpProcLatches[i]);
+	}
+	pfree(wakeUpProcLatches);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+static void
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	/* If target LSN is already replayed, exit immediately */
+	if (targetLSN <= GetXLogReplayRecPtr(NULL))
+		return;
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		currentLSN = GetXLogReplayRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is not in progress"),
+					 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+							   LSN_FORMAT_ARGS(targetLSN),
+							   LSN_FORMAT_ARGS(currentLSN))));
+
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(
+													   GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	deleteLSNWaiter();
+
+	if (targetLSN > currentLSN)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+						LSN_FORMAT_ARGS(targetLSN),
+						LSN_FORMAT_ARGS(currentLSN))));
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	int64		timeout = PG_GETARG_INT64(1);
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/* Give up if there is still an active sanpshot. */
+	if (ActiveSnapshotSet())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called without active snapshot"),
+				 errdetail("Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function.")));
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	(void) WaitForLSNReplay(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index fe1deba13ec..7858e5e076b 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2100150f01c..b180ae97af8 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -357,6 +359,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1b23efb26f3..ac66da8638f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index db37beeaae6..d10ca723dc8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -87,6 +87,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -345,6 +346,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 73d9cf85826..9174fae05ff 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6554,6 +6554,12 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '16387',
+  descr => 'wait for the target LSN to be replayed on standby with an optional timeout',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6224', descr => 'get resource managers loaded in system',
   proname => 'pg_get_wal_resource_managers', prorows => '50', proretset => 't',
   provolatile => 'v', prorettype => 'record', proargtypes => '',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..f719feadb05
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,80 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "lib/pairingheap.h"
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/latch.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/*
+	 * A pointer to the latch, which should be set once the waitLSN is
+	 * replayed.
+	 */
+	Latch	   *latch;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 7eade81535a..9e1c26033a1 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54fb..88dc79b2bd6 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..712924c2fad 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_replay_wait.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
new file mode 100644
index 00000000000..dd8e5ab96d2
--- /dev/null
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -0,0 +1,222 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby1->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby1->start;
+
+# 1.
+# Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# 2.
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok($output eq 30, "standby reached the same LSN as primary");
+
+# 3.
+# Check two standby waiting LSN
+# Create a streaming second standby with a 1 second delay from the backup
+my $node_standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby2->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby2->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+my $output2 = $node_standby2->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 40, "standby1 reached the same LSN as primary");
+ok($output2 eq 40, "standby2 reached the same LSN as primary");
+
+# 4.
+# Create a cascading standby waiting LSN
+$backup_name = 'cas_backup';
+$node_standby1->backup($backup_name);
+
+my $cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+$cascading_standby->init_from_backup(
+	$node_standby1, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $cascading_connstr = $node_standby1->connstr;
+$cascading_standby->append_conf(
+	'postgresql.conf', qq(
+	hot_standby_feedback = on
+	recovery_min_apply_delay = '${delay}s'
+));
+
+$cascading_standby->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+$output2 = $cascading_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 50, "standby1 reached the same LSN as primary");
+ok($output2 eq 50, "cascading_standby reached the same LSN as primary");
+
+# 5.
+# Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby1->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 10);");
+$node_standby1->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+
+# 6.
+# Also, check the scenario of multiple LSN waiters.  We make 5 background psql
+# sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 51 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+for (my $i = 0; $i < 5; $i++)
+{
+	print($i);
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	my $psql_session = $node_standby1->background_psql('postgres');
+	$psql_session->query_until(
+		qr/start/, qq[
+		\\echo start
+		CALL pg_wal_replay_wait('${lsn}');
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby1->logfile;
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby1->wait_for_log("count ${i}", $log_offset);
+}
+
+
+# 7.
+# Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for unreachable LSN then promote.  Check the log for the relevant
+# error message.
+my $psql_session = $node_standby1->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn3}');
+]);
+
+$log_offset = -s $node_standby1->logfile;
+$node_standby1->promote;
+$node_standby1->wait_for_log('recovery is not in progress', $log_offset);
+
+$node_standby1->stop;
+$node_standby2->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 635e6d6e215..73fafc69f26 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3109,6 +3109,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#100Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#99)
1 attachment(s)

On Mon, Jul 15, 2024 at 4:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Thanks to Kyotaro for the review. And thanks to Ivan for the patch
revision. I made another revision of the patch.

I've noticed failures on cfbot. The attached revision addressed docs
build failure. Also it adds some "quits" for background psql sessions
for tests. Probably this will address test hangs on windows.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v22-0001-Implement-pg_wal_replay_wait-stored-procedure.patchapplication/octet-stream; name=v22-0001-Implement-pg_wal_replay_wait-stored-procedure.patchDownload
From d97ff383c9b535ccacd9a959c7b41536fdc1e7b4 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 15 Jul 2024 03:53:51 +0300
Subject: [PATCH v22] Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.

The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working without an active snapshot,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
 doc/src/sgml/func.sgml                        | 111 ++++++
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 341 ++++++++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/catalog/pg_proc.dat               |   6 +
 src/include/commands/waitlsn.h                |  80 ++++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wal_replay_wait.pl    | 224 ++++++++++++
 src/tools/pgindent/typedefs.list              |   2 +
 19 files changed, 826 insertions(+), 3 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wal_replay_wait.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 785886af714..6a688af5fee 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28893,6 +28893,117 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    The procedure shown in <xref linkend="recovery-synchronization-procedure-table"/>
+    can be executed only during recovery.
+   </para>
+
+   <table id="recovery-synchronization-procedure-table">
+    <title>Recovery Synchronization Procedure</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Waits until recovery replays <literal>target_lsn</literal>.
+        If no <parameter>timeout</parameter> is specified or it is set to
+        zero, this procedure waits indefinitely for the
+        <literal>target_lsn</literal>.  If the <parameter>timeout</parameter>
+        is specified (in milliseconds) and is greater than zero, the
+        procedure waits until <literal>target_lsn</literal> is reached or
+        the specified <parameter>timeout</parameter> has elapsed.
+        On timeout, or if the server is promoted before
+        <literal>target_lsn</literal> is reached, an error is emitted.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 100);
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+    </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can be used within
+     the transaction.
+
+   <programlisting>
+postgres=# BEGIN;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d119ab909dc..dfc8cf2dcf2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -38,6 +38,7 @@
 #include "commands/async.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
+#include "commands/waitlsn.h"
 #include "common/pg_prng.h"
 #include "executor/spi.h"
 #include "libpq/be-fsstubs.h"
@@ -2809,6 +2810,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 33e27a6e72c..7c82e2cd8b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/waitlsn.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -6130,6 +6131,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before achieving the target LSN.
+	 */
+	WaitLSNSetLatches(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2ed3ea2b45b..ad817fbca67 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index ae099e328c2..623b9539b15 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..89dd5a9ccd3
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,341 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/wait_event_types.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->latch = MyLatch;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
+ * LSN waiters when InvalidXLogRecPtr is given.
+ */
+void
+WaitLSNSetLatches(XLogRecPtr currentLSN)
+{
+	int			i;
+	Latch	  **wakeUpProcLatches;
+	int			numWakeUpProcs = 0;
+
+	wakeUpProcLatches = palloc(sizeof(Latch *) * MaxBackends);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process latches to set them later.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		wakeUpProcLatches[numWakeUpProcs++] = procInfo->latch;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		SetLatch(wakeUpProcLatches[i]);
+	}
+	pfree(wakeUpProcLatches);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+static void
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	/* If target LSN is already replayed, exit immediately */
+	if (targetLSN <= GetXLogReplayRecPtr(NULL))
+		return;
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		currentLSN = GetXLogReplayRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is not in progress"),
+					 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+							   LSN_FORMAT_ARGS(targetLSN),
+							   LSN_FORMAT_ARGS(currentLSN))));
+
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(
+													   GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	deleteLSNWaiter();
+
+	if (targetLSN > currentLSN)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+						LSN_FORMAT_ARGS(targetLSN),
+						LSN_FORMAT_ARGS(currentLSN))));
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	int64		timeout = PG_GETARG_INT64(1);
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/* Give up if there is still an active sanpshot. */
+	if (ActiveSnapshotSet())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called without active snapshot"),
+				 errdetail("Make sure pg_wal_replay_wait() isn't called within a transaction, another procedure, or a function.")));
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	(void) WaitForLSNReplay(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index fe1deba13ec..7858e5e076b 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2100150f01c..b180ae97af8 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -357,6 +359,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1b23efb26f3..ac66da8638f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index db37beeaae6..d10ca723dc8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -87,6 +87,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -345,6 +346,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 73d9cf85826..9174fae05ff 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6554,6 +6554,12 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '16387',
+  descr => 'wait for the target LSN to be replayed on standby with an optional timeout',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6224', descr => 'get resource managers loaded in system',
   proname => 'pg_get_wal_resource_managers', prorows => '50', proretset => 't',
   provolatile => 'v', prorettype => 'record', proargtypes => '',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..f719feadb05
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,80 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "lib/pairingheap.h"
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/latch.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/*
+	 * A pointer to the latch, which should be set once the waitLSN is
+	 * replayed.
+	 */
+	Latch	   *latch;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 7eade81535a..9e1c26033a1 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54fb..88dc79b2bd6 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..712924c2fad 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_replay_wait.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
new file mode 100644
index 00000000000..a1ae074e56b
--- /dev/null
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -0,0 +1,224 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby1->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby1->start;
+
+# 1.
+# Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# 2.
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok($output eq 30, "standby reached the same LSN as primary");
+
+# 3.
+# Check two standby waiting LSN
+# Create a streaming second standby with a 1 second delay from the backup
+my $node_standby2 = PostgreSQL::Test::Cluster->new('standby2');
+$node_standby2->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby2->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby2->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(31, 40))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+my $output2 = $node_standby2->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 40, "standby1 reached the same LSN as primary");
+ok($output2 eq 40, "standby2 reached the same LSN as primary");
+
+# 4.
+# Create a cascading standby waiting LSN
+$backup_name = 'cas_backup';
+$node_standby1->backup($backup_name);
+
+my $cascading_standby = PostgreSQL::Test::Cluster->new('cascading_standby');
+$cascading_standby->init_from_backup(
+	$node_standby1, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+
+my $cascading_connstr = $node_standby1->connstr;
+$cascading_standby->append_conf(
+	'postgresql.conf', qq(
+	hot_standby_feedback = on
+	recovery_min_apply_delay = '${delay}s'
+));
+
+$cascading_standby->start;
+
+# Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(41, 50))");
+$lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+$output2 = $cascading_standby->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby and standby2 are the same as
+# primary's LSN
+ok($output eq 50, "standby1 reached the same LSN as primary");
+ok($output2 eq 50, "cascading_standby reached the same LSN as primary");
+
+# 5.
+# Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby1->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 10);");
+$node_standby1->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+
+# 6.
+# Also, check the scenario of multiple LSN waiters.  We make 5 background psql
+# sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 51 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	print($i);
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby1->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		CALL pg_wal_replay_wait('${lsn}');
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby1->logfile;
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby1->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+
+# 7.
+# Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for unreachable LSN then promote.  Check the log for the relevant
+# error message.
+my $psql_session = $node_standby1->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn3}');
+]);
+
+$log_offset = -s $node_standby1->logfile;
+$node_standby1->promote;
+$node_standby1->wait_for_log('recovery is not in progress', $log_offset);
+
+$node_standby1->stop;
+$node_standby2->stop;
+$node_primary->stop;
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 635e6d6e215..73fafc69f26 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3109,6 +3109,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#101Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#100)
1 attachment(s)

On Mon, Jul 15, 2024 at 2:02 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Jul 15, 2024 at 4:24 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Thanks to Kyotaro for the review. And thanks to Ivan for the patch
revision. I made another revision of the patch.

I've noticed failures on cfbot. The attached revision addressed docs
build failure. Also it adds some "quits" for background psql sessions
for tests. Probably this will address test hangs on windows.

I made the following changes to the patch.

1) I've changed the check for snapshot in pg_wal_replay_wait(). Now
it checks that GetOldestSnapshot() returns NULL. It happens when both
ActiveSnapshot is NULL and RegisteredSnapshots pairing heap is empty.
This is the same condition when SnapshotResetXmin() sets out xmin to
invalid. Thus, we are not preventing WAL from replay. This should be
satisfied when pg_wal_replay_wait() isn't called within a transaction
with an isolation level higher than READ COMMITTED, another procedure,
or a function. Documented it this way.

2) Explicitly document in the PortalRunUtility() comment that
pg_wal_replay_wait() is another case when active snapshot gets
released.

3) I've removed tests with cascading replication. It's rather unclear
what problem these tests could potentially spot.

4) Did some improvements to docs, comments and commit message to make
them consistent with the patch contents.

The commit to pg17 was inconsiderate. But I feel this patch got much
better shape. Especially, now it's clear when the
pg_wal_replay_wait() procedure can be used. So, I dare commit this to
pg18 if nobody objects.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v23-0001-Implement-pg_wal_replay_wait-stored-procedure.patchapplication/x-patch; name=v23-0001-Implement-pg_wal_replay_wait-stored-procedure.patchDownload
From f6713100b6231f2d124a98021d9fbc353caad0db Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 31 Jul 2024 19:13:55 +0300
Subject: [PATCH v23] Implement pg_wal_replay_wait() stored procedure

pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top.  During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.

pg_wal_replay_wait() needs to wait without any snapshot held.  Otherwise,
the snapshot could prevent the replay of WAL records, implying a kind of
self-deadlock.  This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working without an active snapshot,
not a function.

Catversion is bumped.

Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
---
 doc/src/sgml/func.sgml                        | 117 ++++++
 src/backend/access/transam/xact.c             |   6 +
 src/backend/access/transam/xlog.c             |   7 +
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/catalog/system_functions.sql      |   3 +
 src/backend/commands/Makefile                 |   3 +-
 src/backend/commands/meson.build              |   1 +
 src/backend/commands/waitlsn.c                | 363 ++++++++++++++++++
 src/backend/lib/pairingheap.c                 |  18 +-
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/lmgr/proc.c               |   6 +
 src/backend/tcop/pquery.c                     |   9 +-
 .../utils/activity/wait_event_names.txt       |   2 +
 src/include/catalog/pg_proc.dat               |   6 +
 src/include/commands/waitlsn.h                |  80 ++++
 src/include/lib/pairingheap.h                 |   3 +
 src/include/storage/lwlocklist.h              |   1 +
 src/test/recovery/meson.build                 |   1 +
 src/test/recovery/t/043_wal_replay_wait.pl    | 150 ++++++++
 src/tools/pgindent/typedefs.list              |   2 +
 20 files changed, 785 insertions(+), 7 deletions(-)
 create mode 100644 src/backend/commands/waitlsn.c
 create mode 100644 src/include/commands/waitlsn.h
 create mode 100644 src/test/recovery/t/043_wal_replay_wait.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b39f97dc8de..3cf896b22fa 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28911,6 +28911,123 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     the pause, the rate of WAL generation and available disk space.
    </para>
 
+   <para>
+    The procedure shown in <xref linkend="recovery-synchronization-procedure-table"/>
+    can be executed only during recovery.
+   </para>
+
+   <table id="recovery-synchronization-procedure-table">
+    <title>Recovery Synchronization Procedure</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Procedure
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_wal_replay_wait</primary>
+        </indexterm>
+        <function>pg_wal_replay_wait</function> (
+          <parameter>target_lsn</parameter> <type>pg_lsn</type>,
+          <parameter>timeout</parameter> <type>bigint</type> <literal>DEFAULT</literal> <literal>0</literal>)
+        <returnvalue>void</returnvalue>
+       </para>
+       <para>
+        Waits until recovery replays <literal>target_lsn</literal>.
+        If no <parameter>timeout</parameter> is specified or it is set to
+        zero, this procedure waits indefinitely for the
+        <literal>target_lsn</literal>.  If the <parameter>timeout</parameter>
+        is specified (in milliseconds) and is greater than zero, the
+        procedure waits until <literal>target_lsn</literal> is reached or
+        the specified <parameter>timeout</parameter> has elapsed.
+        On timeout, or if the server is promoted before
+        <literal>target_lsn</literal> is reached, an error is emitted.
+       </para></entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_wal_replay_wait</function> waits till
+    <parameter>target_lsn</parameter> to be replayed on standby.
+    That is, after this function execution, the value returned by
+    <function>pg_last_wal_replay_lsn</function> should be greater or equal
+    to the <parameter>target_lsn</parameter> value.  This is useful to achieve
+    read-your-writes-consistency, while using async replica for reads and
+    primary for writes.  In that case <acronym>lsn</acronym> of the last
+    modification should be stored on the client application side or the
+    connection pooler side.
+   </para>
+
+   <para>
+    You can use <function>pg_wal_replay_wait</function> to wait for
+    the <type>pg_lsn</type> value.  For example, an application could update
+    the <literal>movie</literal> table and get the <acronym>lsn</acronym> after
+    changes just made.  This example uses <function>pg_current_wal_insert_lsn</function>
+    on primary server to get the <acronym>lsn</acronym> given that
+    <varname>synchronous_commit</varname> could be set to
+    <literal>off</literal>.
+
+   <programlisting>
+postgres=# UPDATE movie SET genre = 'Dramatic' WHERE genre = 'Drama';
+UPDATE 100
+postgres=# SELECT pg_current_wal_insert_lsn();
+pg_current_wal_insert_lsn
+--------------------
+0/306EE20
+(1 row)
+   </programlisting>
+
+   Then an application could run <function>pg_wal_replay_wait</function>
+   with the <acronym>lsn</acronym> obtained from primary.  After that the
+   changes made of primary should be guaranteed to be visible on replica.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20');
+CALL
+postgres=# SELECT * FROM movie WHERE genre = 'Drama';
+ genre
+-------
+(0 rows)
+   </programlisting>
+
+   It may also happen that target <acronym>lsn</acronym> is not achieved
+   within the timeout.  In that case the error is thrown.
+
+   <programlisting>
+postgres=# CALL pg_wal_replay_wait('0/306EE20', 100);
+ERROR:  timed out while waiting for target LSN 0/306EE20 to be replayed; current replay LSN 0/306EA60
+    </programlisting>
+
+   </para>
+
+   <para>
+     <function>pg_wal_replay_wait</function> can't be used within
+     a transaction with an isolation level higher than
+     <literal>READ COMMITTED</literal>, another procedure, or a function.
+     All the cases above imply holding a snapshot, which could prevent
+     WAL records from replaying (see <xref linkend="hot-standby-conflict"/>)
+     and cause an indirect deadlock.
+
+   <programlisting>
+postgres=# BEGIN;
+BEGIN
+postgres=*# CALL pg_wal_replay_wait('0/306EE20');
+ERROR:  pg_wal_replay_wait() must be only called without an active or registered snapshot
+DETAIL:  Make sure pg_wal_replay_wait() isn't called within a transaction with an isolation level higher than READ COMMITTED, another procedure, or a function.
+   </programlisting>
+
+   </para>
   </sect2>
 
   <sect2 id="functions-snapshot-synchronization">
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d119ab909dc..dfc8cf2dcf2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -38,6 +38,7 @@
 #include "commands/async.h"
 #include "commands/tablecmds.h"
 #include "commands/trigger.h"
+#include "commands/waitlsn.h"
 #include "common/pg_prng.h"
 #include "executor/spi.h"
 #include "libpq/be-fsstubs.h"
@@ -2809,6 +2810,11 @@ AbortTransaction(void)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Clear wait information and command progress indicator */
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f86f4b5c4b7..d7fd1c9823f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
+#include "commands/waitlsn.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
 #include "executor/instrument.h"
@@ -6143,6 +6144,12 @@ StartupXLOG(void)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * Wake up all waiters for replay LSN.  They need to report an error that
+	 * recovery was ended before achieving the target LSN.
+	 */
+	WaitLSNSetLatches(InvalidXLogRecPtr);
+
 	/*
 	 * Shutdown the recovery environment.  This must occur after
 	 * RecoverPreparedTransactions() (see notes in lock_twophase_recover())
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 2ed3ea2b45b..ad817fbca67 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,16 @@ PerformWalRecovery(void)
 				break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for then walk
+			 * over the shared memory array and set latches to notify the
+			 * waiters.
+			 */
+			if (waitLSNState &&
+				(XLogRecoveryCtl->lastReplayedEndRecPtr >=
+				 pg_atomic_read_u64(&waitLSNState->minWaitedLSN)))
+				WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index ae099e328c2..623b9539b15 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -414,6 +414,9 @@ CREATE OR REPLACE FUNCTION
   json_populate_recordset(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
   RETURNS SETOF anyelement LANGUAGE internal STABLE ROWS 100  AS 'json_populate_recordset' PARALLEL SAFE;
 
+CREATE OR REPLACE PROCEDURE pg_wal_replay_wait(target_lsn pg_lsn, timeout int8 DEFAULT 0)
+  LANGUAGE internal AS 'pg_wal_replay_wait';
+
 CREATE OR REPLACE FUNCTION pg_logical_slot_get_changes(
     IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int, VARIADIC options text[] DEFAULT '{}',
     OUT lsn pg_lsn, OUT xid xid, OUT data text)
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91c..cede90c3b98 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	waitlsn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 6dd00a4abde..7549be5dc3b 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'waitlsn.c',
 )
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
new file mode 100644
index 00000000000..3170f0792a5
--- /dev/null
+++ b/src/backend/commands/waitlsn.c
@@ -0,0 +1,363 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.c
+ *	  Implements waiting for the given replay LSN, which is used in
+ *	  CALL pg_wal_replay_wait(target_lsn pg_lsn, timeout float8).
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/waitlsn.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <float.h>
+#include <math.h>
+
+#include "pgstat.h"
+#include "access/xlog.h"
+#include "access/xlogrecovery.h"
+#include "commands/waitlsn.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/fmgrprotos.h"
+#include "utils/pg_lsn.h"
+#include "utils/snapmgr.h"
+#include "utils/wait_event_types.h"
+
+static int	waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b,
+						void *arg);
+
+struct WaitLSNState *waitLSNState = NULL;
+
+/* Report the amount of shared memory space needed for WaitLSNState. */
+Size
+WaitLSNShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitLSNState, procInfos);
+	size = add_size(size, mul_size(MaxBackends, sizeof(WaitLSNProcInfo)));
+	return size;
+}
+
+/* Initialize the WaitLSNState in the shared memory. */
+void
+WaitLSNShmemInit(void)
+{
+	bool		found;
+
+	waitLSNState = (WaitLSNState *) ShmemInitStruct("WaitLSNState",
+													WaitLSNShmemSize(),
+													&found);
+	if (!found)
+	{
+		pg_atomic_init_u64(&waitLSNState->minWaitedLSN, PG_UINT64_MAX);
+		pairingheap_initialize(&waitLSNState->waitersHeap, waitlsn_cmp, NULL);
+		memset(&waitLSNState->procInfos, 0, MaxBackends * sizeof(WaitLSNProcInfo));
+	}
+}
+
+/*
+ * Comparison function for waitLSN->waitersHeap heap.  Waiting processes are
+ * ordered by lsn, so that the waiter with smallest lsn is at the top.
+ */
+static int
+waitlsn_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
+{
+	const WaitLSNProcInfo *aproc = pairingheap_const_container(WaitLSNProcInfo, phNode, a);
+	const WaitLSNProcInfo *bproc = pairingheap_const_container(WaitLSNProcInfo, phNode, b);
+
+	if (aproc->waitLSN < bproc->waitLSN)
+		return 1;
+	else if (aproc->waitLSN > bproc->waitLSN)
+		return -1;
+	else
+		return 0;
+}
+
+/*
+ * Update waitLSN->minWaitedLSN according to the current state of
+ * waitLSN->waitersHeap.
+ */
+static void
+updateMinWaitedLSN(void)
+{
+	XLogRecPtr	minWaitedLSN = PG_UINT64_MAX;
+
+	if (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+
+		minWaitedLSN = pairingheap_container(WaitLSNProcInfo, phNode, node)->waitLSN;
+	}
+
+	pg_atomic_write_u64(&waitLSNState->minWaitedLSN, minWaitedLSN);
+}
+
+/*
+ * Put the current process into the heap of LSN waiters.
+ */
+static void
+addLSNWaiter(XLogRecPtr lsn)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	Assert(!procInfo->inHeap);
+
+	procInfo->latch = MyLatch;
+	procInfo->waitLSN = lsn;
+
+	pairingheap_add(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = true;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Remove the current process from the heap of LSN waiters if it's there.
+ */
+static void
+deleteLSNWaiter(void)
+{
+	WaitLSNProcInfo *procInfo = &waitLSNState->procInfos[MyProcNumber];
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	if (!procInfo->inHeap)
+	{
+		LWLockRelease(WaitLSNLock);
+		return;
+	}
+
+	pairingheap_remove(&waitLSNState->waitersHeap, &procInfo->phNode);
+	procInfo->inHeap = false;
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+}
+
+/*
+ * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
+ * LSN waiters when InvalidXLogRecPtr is given.
+ */
+void
+WaitLSNSetLatches(XLogRecPtr currentLSN)
+{
+	int			i;
+	Latch	  **wakeUpProcLatches;
+	int			numWakeUpProcs = 0;
+
+	wakeUpProcLatches = palloc(sizeof(Latch *) * MaxBackends);
+
+	LWLockAcquire(WaitLSNLock, LW_EXCLUSIVE);
+
+	/*
+	 * Iterate the pairing heap of waiting processes till we find LSN not yet
+	 * replayed.  Record the process latches to set them later.
+	 */
+	while (!pairingheap_is_empty(&waitLSNState->waitersHeap))
+	{
+		pairingheap_node *node = pairingheap_first(&waitLSNState->waitersHeap);
+		WaitLSNProcInfo *procInfo = pairingheap_container(WaitLSNProcInfo, phNode, node);
+
+		if (!XLogRecPtrIsInvalid(currentLSN) &&
+			procInfo->waitLSN > currentLSN)
+			break;
+
+		wakeUpProcLatches[numWakeUpProcs++] = procInfo->latch;
+		(void) pairingheap_remove_first(&waitLSNState->waitersHeap);
+		procInfo->inHeap = false;
+	}
+
+	updateMinWaitedLSN();
+
+	LWLockRelease(WaitLSNLock);
+
+	/*
+	 * Set latches for processes, whose waited LSNs are already replayed. As
+	 * the time consuming operations, we do it this outside of WaitLSNLock.
+	 * This is  actually fine because procLatch isn't ever freed, so we just
+	 * can potentially set the wrong process' (or no process') latch.
+	 */
+	for (i = 0; i < numWakeUpProcs; i++)
+	{
+		SetLatch(wakeUpProcLatches[i]);
+	}
+	pfree(wakeUpProcLatches);
+}
+
+/*
+ * Delete our item from shmem array if any.
+ */
+void
+WaitLSNCleanup(void)
+{
+	/*
+	 * We do a fast-path check of the 'inHeap' flag without the lock.  This
+	 * flag is set to true only by the process itself.  So, it's only possible
+	 * to get a false positive.  But that will be eliminated by a recheck
+	 * inside deleteLSNWaiter().
+	 */
+	if (waitLSNState->procInfos[MyProcNumber].inHeap)
+		deleteLSNWaiter();
+}
+
+/*
+ * Wait using MyLatch till the given LSN is replayed, the postmaster dies or
+ * timeout happens.
+ */
+static void
+WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
+{
+	XLogRecPtr	currentLSN;
+	TimestampTz endtime = 0;
+	int			wake_events = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH;
+
+	/* Shouldn't be called when shmem isn't initialized */
+	Assert(waitLSNState);
+
+	/* Should have a valid proc number */
+	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
+
+	if (!RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is not in progress"),
+				 errhint("Waiting for LSN can only be executed during recovery.")));
+
+	/* If target LSN is already replayed, exit immediately */
+	if (targetLSN <= GetXLogReplayRecPtr(NULL))
+		return;
+
+	if (timeout > 0)
+	{
+		endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout);
+		wake_events |= WL_TIMEOUT;
+	}
+
+	/*
+	 * Add our process to the pairing heap of waiters.  It might happen that
+	 * target LSN gets replayed before we do.  Another check at the beginning
+	 * of the loop below prevents the race condition.
+	 */
+	addLSNWaiter(targetLSN);
+
+	for (;;)
+	{
+		int			rc;
+		long		delay_ms = 0;
+
+		/* Check if the waited LSN has been replayed */
+		currentLSN = GetXLogReplayRecPtr(NULL);
+		if (targetLSN <= currentLSN)
+			break;
+
+		/* Recheck that recovery is still in-progress */
+		if (!RecoveryInProgress())
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("recovery is not in progress"),
+					 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
+							   LSN_FORMAT_ARGS(targetLSN),
+							   LSN_FORMAT_ARGS(currentLSN))));
+
+		/*
+		 * If the timeout value is specified, calculate the number of
+		 * milliseconds before the timeout.  Exit if the timeout is already
+		 * achieved.
+		 */
+		if (timeout > 0)
+		{
+			delay_ms = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), endtime);
+			if (delay_ms <= 0)
+				break;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		rc = WaitLatch(MyLatch, wake_events, delay_ms,
+					   WAIT_EVENT_WAIT_FOR_WAL_REPLAY);
+
+		if (rc & WL_LATCH_SET)
+			ResetLatch(MyLatch);
+	}
+
+	/*
+	 * Delete our process from the shared memory pairing heap.  We might
+	 * already be deleted by the startup process.  The 'inHeap' flag prevents
+	 * us from the double deletion.
+	 */
+	deleteLSNWaiter();
+
+	/*
+	 * If we didn't achieve the target LSN, we must be exited by timeout.
+	 */
+	if (targetLSN > currentLSN)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_QUERY_CANCELED),
+				 errmsg("timed out while waiting for target LSN %X/%X to be replayed; current replay LSN %X/%X",
+						LSN_FORMAT_ARGS(targetLSN),
+						LSN_FORMAT_ARGS(currentLSN))));
+	}
+}
+
+Datum
+pg_wal_replay_wait(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	target_lsn = PG_GETARG_LSN(0);
+	int64		timeout = PG_GETARG_INT64(1);
+
+	if (timeout < 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+				 errmsg("\"timeout\" must not be negative")));
+
+	/*
+	 * We are going to wait for the LSN replay.  We should first care that we
+	 * don't hold a snapshot and correspondingly our MyProc->xmin is invalid.
+	 * Otherwise, our snapshot could prevent the replay of WAL records
+	 * implying a kind of self-deadlock.  This is the reason why
+	 * pg_wal_replay_wait() is a procedure, not a function.
+	 *
+	 * At first, we should check there is no active snapshot.  According to
+	 * PlannedStmtRequiresSnapshot(), even in an atomic context, CallStmt is
+	 * processed with a snapshot.  Thankfully, we can pop this snapshot,
+	 * because PortalRunUtility() can tolerate this.
+	 */
+	if (ActiveSnapshotSet())
+		PopActiveSnapshot();
+
+	/*
+	 * At second, invalidate a catalog snapshot if any.  And we should be done
+	 * with the preparation.
+	 */
+	InvalidateCatalogSnapshot();
+
+	/* Give up if there is still an active or registered sanpshot. */
+	if (GetOldestSnapshot())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("pg_wal_replay_wait() must be only called without an active or registered snapshot"),
+				 errdetail("Make sure pg_wal_replay_wait() isn't called within a transaction with an isolation level higher than READ COMMITTED, another procedure, or a function.")));
+
+	/*
+	 * As the result we should hold no snapshot, and correspondingly our xmin
+	 * should be unset.
+	 */
+	Assert(MyProc->xmin == InvalidTransactionId);
+
+	(void) WaitForLSNReplay(target_lsn, timeout);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c
index fe1deba13ec..7858e5e076b 100644
--- a/src/backend/lib/pairingheap.c
+++ b/src/backend/lib/pairingheap.c
@@ -44,12 +44,26 @@ pairingheap_allocate(pairingheap_comparator compare, void *arg)
 	pairingheap *heap;
 
 	heap = (pairingheap *) palloc(sizeof(pairingheap));
+	pairingheap_initialize(heap, compare, arg);
+
+	return heap;
+}
+
+/*
+ * pairingheap_initialize
+ *
+ * Same as pairingheap_allocate(), but initializes the pairing heap in-place
+ * rather than allocating a new chunk of memory.  Useful to store the pairing
+ * heap in a shared memory.
+ */
+void
+pairingheap_initialize(pairingheap *heap, pairingheap_comparator compare,
+					   void *arg)
+{
 	heap->ph_compare = compare;
 	heap->ph_arg = arg;
 
 	heap->ph_root = NULL;
-
-	return heap;
 }
 
 /*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index ca930af08f2..e3e7651d7b7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -152,6 +153,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, WaitEventCustomShmemSize());
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
+	size = add_size(size, WaitLSNShmemSize());
 #ifdef EXEC_BACKEND
 	size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -357,6 +359,7 @@ CreateOrAttachShmemStructs(void)
 	StatsShmemInit();
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
+	WaitLSNShmemInit();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 1b23efb26f3..ac66da8638f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -36,6 +36,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xlogutils.h"
+#include "commands/waitlsn.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -862,6 +863,11 @@ ProcKill(int code, Datum arg)
 	 */
 	LWLockReleaseAll();
 
+	/*
+	 * Cleanup waiting for LSN if any.
+	 */
+	WaitLSNCleanup();
+
 	/* Cancel any pending condition variable sleep, too */
 	ConditionVariableCancelSleep();
 
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 0c45fcf318f..a1f8d03db1e 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -1168,10 +1168,11 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
 	MemoryContextSwitchTo(portal->portalContext);
 
 	/*
-	 * Some utility commands (e.g., VACUUM) pop the ActiveSnapshot stack from
-	 * under us, so don't complain if it's now empty.  Otherwise, our snapshot
-	 * should be the top one; pop it.  Note that this could be a different
-	 * snapshot from the one we made above; see EnsurePortalSnapshotExists.
+	 * Some utility commands (e.g., VACUUM, CALL pg_wal_replay_wait()) pop the
+	 * ActiveSnapshot stack from under us, so don't complain if it's now
+	 * empty.  Otherwise, our snapshot should be the top one; pop it.  Note
+	 * that this could be a different snapshot from the one we made above; see
+	 * EnsurePortalSnapshotExists.
 	 */
 	if (portal->portalSnapshot != NULL && ActiveSnapshotSet())
 	{
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index db37beeaae6..d10ca723dc8 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -87,6 +87,7 @@ LIBPQWALRECEIVER_CONNECT	"Waiting in WAL receiver to establish connection to rem
 LIBPQWALRECEIVER_RECEIVE	"Waiting in WAL receiver to receive data from remote server."
 SSL_OPEN_SERVER	"Waiting for SSL while attempting connection."
 WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the physical standby."
+WAIT_FOR_WAL_REPLAY	"Waiting for a replay of the particular WAL position on the physical standby."
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
@@ -345,6 +346,7 @@ WALSummarizer	"Waiting to read or update WAL summarization state."
 DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
+WaitLSN	"Waiting to read or update shared Wait-for-LSN state."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 06b2f4ba66c..2dbf0fd6aaa 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6581,6 +6581,12 @@
   prorettype => 'text', proargtypes => '',
   prosrc => 'pg_get_wal_replay_pause_state' },
 
+{ oid => '16387',
+  descr => 'wait for the target LSN to be replayed on standby with an optional timeout',
+  proname => 'pg_wal_replay_wait', prokind => 'p', prorettype => 'void',
+  proargtypes => 'pg_lsn int8', proargnames => '{target_lsn,timeout}',
+  prosrc => 'pg_wal_replay_wait' },
+
 { oid => '6224', descr => 'get resource managers loaded in system',
   proname => 'pg_get_wal_resource_managers', prorows => '50', proretset => 't',
   provolatile => 'v', prorettype => 'record', proargtypes => '',
diff --git a/src/include/commands/waitlsn.h b/src/include/commands/waitlsn.h
new file mode 100644
index 00000000000..f719feadb05
--- /dev/null
+++ b/src/include/commands/waitlsn.h
@@ -0,0 +1,80 @@
+/*-------------------------------------------------------------------------
+ *
+ * waitlsn.h
+ *	  Declarations for LSN replay waiting routines.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * src/include/commands/waitlsn.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef WAIT_LSN_H
+#define WAIT_LSN_H
+
+#include "lib/pairingheap.h"
+#include "postgres.h"
+#include "port/atomics.h"
+#include "storage/latch.h"
+#include "storage/spin.h"
+#include "tcop/dest.h"
+
+/*
+ * WaitLSNProcInfo - the shared memory structure representing information
+ * about the single process, which may wait for LSN replay.  An item of
+ * waitLSN->procInfos array.
+ */
+typedef struct WaitLSNProcInfo
+{
+	/* LSN, which this process is waiting for */
+	XLogRecPtr	waitLSN;
+
+	/*
+	 * A pointer to the latch, which should be set once the waitLSN is
+	 * replayed.
+	 */
+	Latch	   *latch;
+
+	/* A pairing heap node for participation in waitLSNState->waitersHeap */
+	pairingheap_node phNode;
+
+	/*
+	 * A flag indicating that this item is present in
+	 * waitLSNState->waitersHeap
+	 */
+	bool		inHeap;
+} WaitLSNProcInfo;
+
+/*
+ * WaitLSNState - the shared memory state for the replay LSN waiting facility.
+ */
+typedef struct WaitLSNState
+{
+	/*
+	 * The minimum LSN value some process is waiting for.  Used for the
+	 * fast-path checking if we need to wake up any waiters after replaying a
+	 * WAL record.  Could be read lock-less.  Update protected by WaitLSNLock.
+	 */
+	pg_atomic_uint64 minWaitedLSN;
+
+	/*
+	 * A pairing heap of waiting processes order by LSN values (least LSN is
+	 * on top).  Protected by WaitLSNLock.
+	 */
+	pairingheap waitersHeap;
+
+	/*
+	 * An array with per-process information, indexed by the process number.
+	 * Protected by WaitLSNLock.
+	 */
+	WaitLSNProcInfo procInfos[FLEXIBLE_ARRAY_MEMBER];
+} WaitLSNState;
+
+extern PGDLLIMPORT WaitLSNState *waitLSNState;
+
+extern Size WaitLSNShmemSize(void);
+extern void WaitLSNShmemInit(void);
+extern void WaitLSNSetLatches(XLogRecPtr currentLSN);
+extern void WaitLSNCleanup(void);
+
+#endif							/* WAIT_LSN_H */
diff --git a/src/include/lib/pairingheap.h b/src/include/lib/pairingheap.h
index 7eade81535a..9e1c26033a1 100644
--- a/src/include/lib/pairingheap.h
+++ b/src/include/lib/pairingheap.h
@@ -77,6 +77,9 @@ typedef struct pairingheap
 
 extern pairingheap *pairingheap_allocate(pairingheap_comparator compare,
 										 void *arg);
+extern void pairingheap_initialize(pairingheap *heap,
+								   pairingheap_comparator compare,
+								   void *arg);
 extern void pairingheap_free(pairingheap *heap);
 extern void pairingheap_add(pairingheap *heap, pairingheap_node *node);
 extern pairingheap_node *pairingheap_first(pairingheap *heap);
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54fb..88dc79b2bd6 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, WaitLSN)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..712924c2fad 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_wal_replay_wait.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
new file mode 100644
index 00000000000..e4842730b05
--- /dev/null
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -0,0 +1,150 @@
+# Checks waiting for the lsn replay on standby using
+# pg_wal_replay_wait() procedure.
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->start;
+
+# And some content and take a backup
+$node_primary->safe_psql('postgres',
+	"CREATE TABLE wait_test AS SELECT generate_series(1,10) AS a");
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Create a streaming standby with a 1 second delay from the backup
+my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
+my $delay = 1;
+$node_standby1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby1->append_conf(
+	'postgresql.conf', qq[
+	recovery_min_apply_delay = '${delay}s'
+]);
+$node_standby1->start;
+
+# 1. Make sure that pg_wal_replay_wait() works: add new content to
+# primary and memorize primary's insert LSN, then wait for that LSN to be
+# replayed on standby.
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
+my $lsn1 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+my $output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn1}', 1000000);
+	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
+]);
+
+# Make sure the current LSN on standby is at least as big as the LSN we
+# observed on primary's before.
+ok($output >= 0,
+	"standby reached the same LSN as primary after pg_wal_replay_wait()");
+
+# 2. Check that new data is visible after calling pg_wal_replay_wait()
+$node_primary->safe_psql('postgres',
+	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
+my $lsn2 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
+$output = $node_standby1->safe_psql(
+	'postgres', qq[
+	CALL pg_wal_replay_wait('${lsn2}');
+	SELECT count(*) FROM wait_test;
+]);
+
+# Make sure the count(*) on standby reflects the recent changes on primary
+ok($output eq 30, "standby reached the same LSN as primary");
+
+# 3. Check that waiting for unreachable LSN triggers the timeout.  The
+# unreachable LSN must be well in advance.  So WAL records issued by
+# the concurrent autovacuum could not affect that.
+my $lsn3 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $stderr;
+$node_standby1->safe_psql('postgres',
+	"CALL pg_wal_replay_wait('${lsn2}', 10);");
+$node_standby1->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
+	stderr => \$stderr);
+ok( $stderr =~ /timed out while waiting for target LSN/,
+	"get timeout on waiting for unreachable LSN");
+
+
+# 4. Also, check the scenario of multiple LSN waiters.  We make 5 background
+# psql sessions each waiting for a corresponding insertion.  When waiting is
+# finished, stored procedures logs if there are visible as many rows as
+# should be.
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    SELECT count(*) FROM wait_test INTO count;
+    IF count >= 31 + i THEN
+      RAISE LOG 'count %', i;
+    END IF;
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+my @psql_sessions;
+for (my $i = 0; $i < 5; $i++)
+{
+	print($i);
+	$node_primary->safe_psql('postgres',
+		"INSERT INTO wait_test VALUES (${i});");
+	my $lsn =
+	  $node_primary->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn()");
+	$psql_sessions[$i] = $node_standby1->background_psql('postgres');
+	$psql_sessions[$i]->query_until(
+		qr/start/, qq[
+		\\echo start
+		CALL pg_wal_replay_wait('${lsn}');
+		SELECT log_count(${i});
+	]);
+}
+my $log_offset = -s $node_standby1->logfile;
+$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+for (my $i = 0; $i < 5; $i++)
+{
+	$node_standby1->wait_for_log("count ${i}", $log_offset);
+	$psql_sessions[$i]->quit;
+}
+
+ok(1, 'multiple LSN waiters reported consistent data');
+
+# 5. Check that the standby promotion terminates the wait on LSN.  Start
+# waiting for an unreachable LSN then promote.  Check the log for the relevant
+# error message.
+my $psql_session = $node_standby1->background_psql('postgres');
+$psql_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	CALL pg_wal_replay_wait('${lsn3}');
+]);
+
+$log_offset = -s $node_standby1->logfile;
+$node_standby1->promote;
+$node_standby1->wait_for_log('recovery is not in progress', $log_offset);
+
+ok(1, 'got error after standby promote');
+
+$node_standby1->stop;
+$node_primary->stop;
+
+# If we send \q with $psql_session->quit the command can be sent to the session
+# already closed. So \q is in initial script, here we only finish IPC::Run.
+$psql_session->{run}->finish;
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3deb6113b80..aaea2468536 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3110,6 +3110,8 @@ WaitEventIO
 WaitEventIPC
 WaitEventSet
 WaitEventTimeout
+WaitLSNProcInfo
+WaitLSNState
 WaitPMResult
 WalCloseMethod
 WalCompression
-- 
2.39.3 (Apple Git-145)

#102Kevin Hale Boyes
kcboyes@gmail.com
In reply to: Alexander Korotkov (#101)

I noticed the commit and had a question and a comment.
There is a small problem in func.sgml in the sentence "After that the
changes made of primary". Should be "on primary".

In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be
moved up above the call to GetXLogReplayRecPtr?
If we get promoted while waiting for the timeout we could
call GetXLogReplayRecPtr while not in recovery.

Thanks,
Kevin.

#103Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kevin Hale Boyes (#102)

On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote:

I noticed the commit and had a question and a comment.
There is a small problem in func.sgml in the sentence "After that the changes made of primary". Should be "on primary".

Thank you for spotting this. Will fix.

In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr?
If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.

This is intentional. After standby gets promoted,
GetXLogReplayRecPtr() returns the last WAL position being replayed
while being standby. So, if standby reached target lsn before being
promoted, we don't have to throw an error.

But this gave me an idea that before the loop we probably need to put
RecoveryInProgress() check after GetXLogReplayRecPtr() too. I'll
recheck that.

------
Regards,
Alexander Korotkov
Supabase

#104Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#103)
3 attachment(s)

On Sat, Aug 3, 2024 at 6:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote:

In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr?
If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.

This is intentional. After standby gets promoted,
GetXLogReplayRecPtr() returns the last WAL position being replayed
while being standby. So, if standby reached target lsn before being
promoted, we don't have to throw an error.

But this gave me an idea that before the loop we probably need to put
RecoveryInProgress() check after GetXLogReplayRecPtr() too. I'll
recheck that.

The attached patchset comprises assorted improvements for pg_wal_replay_wait().

The 0001 patch is intended to improve this situation. Actually, it's
not right to just put RecoveryInProgress() after
GetXLogReplayRecPtr(), because more wal could be replayed between
these calls. Instead we need to recheck GetXLogReplayRecPtr() after
getting negative result of RecoveryInProgress() because WAL replay
position couldn't get updated after.
0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
0003 patch comprises tests for pg_wal_replay_wait() errors.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v1-0002-Improve-header-comment-for-WaitLSNSetLatches.patchapplication/octet-stream; name=v1-0002-Improve-header-comment-for-WaitLSNSetLatches.patchDownload
From efcf5a059ac162e3e86b034d11954c899b9db302 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 6 Aug 2024 04:26:44 +0300
Subject: [PATCH v1 2/3] Improve header comment for WaitLSNSetLatches()

Reflect the fact that we remove waiters from the heap, not just set their
latches.
---
 src/backend/commands/waitlsn.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 6651801dfb4..26a4181a9e5 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -147,8 +147,9 @@ deleteLSNWaiter(void)
 }
 
 /*
- * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
- * LSN waiters when InvalidXLogRecPtr is given.
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
  */
 void
 WaitLSNSetLatches(XLogRecPtr currentLSN)
-- 
2.39.3 (Apple Git-145)

v1-0001-Adjust-pg_wal_replay_wait-procedure-behavior-on-p.patchapplication/octet-stream; name=v1-0001-Adjust-pg_wal_replay_wait-procedure-behavior-on-p.patchDownload
From 1e84803cd111734538dcc630c6f96e23c762d655 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 6 Aug 2024 03:52:39 +0300
Subject: [PATCH v1 1/3] Adjust pg_wal_replay_wait() procedure behavior on
 promoted standby

pg_wal_replay_wait() is intended to be called on standby.  However, standby
can be promoted to primary at any moment, even concurrently with the
pg_wal_replay_wait() call.  If recovery is not currently in progress
that doesn't mean the wait was unsuccessful.  Thus, we always need to recheck
if the target LSN is replayed.

Reported-by: Kevin Hale Boyes
Discussion: https://postgr.es/m/CAPpHfdu5QN%2BZGACS%2B7foxmr8_nekgA2PA%2B-G3BuOUrdBLBFb6Q%40mail.gmail.com
Author: Alexander Korotkov
---
 doc/src/sgml/func.sgml                     |  9 +++++
 src/backend/commands/waitlsn.c             | 41 +++++++++++++++++-----
 src/test/recovery/t/043_wal_replay_wait.pl | 15 ++++++--
 3 files changed, 54 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0f7154b76ab..968a9985527 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28969,6 +28969,15 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     connection pooler side.
    </para>
 
+   <para>
+    <function>pg_wal_replay_wait</function> should be called on standby.
+    If a user calls <function>pg_wal_replay_wait</function> on primary, it
+    will error out.  However, if <function>pg_wal_replay_wait</function> is
+    called on primary promoted from standby and <literal>target_lsn</literal>
+    was already replayed, then <function>pg_wal_replay_wait</function> just
+    exits immediately.
+   </para>
+
    <para>
     You can use <function>pg_wal_replay_wait</function> to wait for
     the <type>pg_lsn</type> value.  For example, an application could update
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 3170f0792a5..6651801dfb4 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -230,14 +230,27 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
 
 	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return;
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("recovery is not in progress"),
 				 errhint("Waiting for LSN can only be executed during recovery.")));
-
-	/* If target LSN is already replayed, exit immediately */
-	if (targetLSN <= GetXLogReplayRecPtr(NULL))
-		return;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return;
+	}
 
 	if (timeout > 0)
 	{
@@ -257,19 +270,29 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = 0;
 
-		/* Check if the waited LSN has been replayed */
-		currentLSN = GetXLogReplayRecPtr(NULL);
-		if (targetLSN <= currentLSN)
-			break;
-
 		/* Recheck that recovery is still in-progress */
 		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.
+			 */
+			if (targetLSN <= GetXLogReplayRecPtr(NULL))
+				return;
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("recovery is not in progress"),
 					 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
 							   LSN_FORMAT_ARGS(targetLSN),
 							   LSN_FORMAT_ARGS(currentLSN))));
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
 
 		/*
 		 * If the timeout value is specified, calculate the number of
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index e4842730b05..c3131acb75a 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -126,12 +126,18 @@ ok(1, 'multiple LSN waiters reported consistent data');
 
 # 5. Check that the standby promotion terminates the wait on LSN.  Start
 # waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.
+# error message.  Alsom check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 my $psql_session = $node_standby1->background_psql('postgres');
 $psql_session->query_until(
 	qr/start/, qq[
 	\\echo start
-	CALL pg_wal_replay_wait('${lsn3}');
+	CALL pg_wal_replay_wait('${lsn4}');
 ]);
 
 $log_offset = -s $node_standby1->logfile;
@@ -140,6 +146,11 @@ $node_standby1->wait_for_log('recovery is not in progress', $log_offset);
 
 ok(1, 'got error after standby promote');
 
+$node_standby1->safe_psql('postgres', "CALL pg_wal_replay_wait(${lsn5});");
+
+ok(1,
+	'wait for already replayed LSN exists immediately even after promotion');
+
 $node_standby1->stop;
 $node_primary->stop;
 
-- 
2.39.3 (Apple Git-145)

v1-0003-Add-tests-for-pg_wal_replay_wait-errors.patchapplication/octet-stream; name=v1-0003-Add-tests-for-pg_wal_replay_wait-errors.patchDownload
From b2c4d68ba5884adcf0c0d935a626af8301650b62 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 6 Aug 2024 05:06:58 +0300
Subject: [PATCH v1 3/3] Add tests for pg_wal_replay_wait() errors

Improve test coverage for pg_wal_replay_wait() procedure by adding test
cases when it errors out.
---
 src/test/recovery/t/043_wal_replay_wait.pl | 40 ++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index c3131acb75a..06b641d1f5f 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -77,8 +77,44 @@ $node_standby1->psql(
 ok( $stderr =~ /timed out while waiting for target LSN/,
 	"get timeout on waiting for unreachable LSN");
 
+# 4. Check that pg_wal_replay_wait() triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
 
-# 4. Also, check the scenario of multiple LSN waiters.  We make 5 background
+$node_primary->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby1->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; CALL pg_wal_replay_wait('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~ /pg_wal_replay_wait() must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ");
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  DECLARE
+    count int;
+  BEGIN
+    CALL pg_wal_replay_wait(target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_standby1->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~ /pg_wal_replay_wait() must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -124,7 +160,7 @@ for (my $i = 0; $i < 5; $i++)
 
 ok(1, 'multiple LSN waiters reported consistent data');
 
-# 5. Check that the standby promotion terminates the wait on LSN.  Start
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
 # waiting for an unreachable LSN then promote.  Check the log for the relevant
 # error message.  Alsom check that waiting for already replayed LSN doesn't
 # cause an error even after promotion.
-- 
2.39.3 (Apple Git-145)

#105Michael Paquier
michael@paquier.xyz
In reply to: Alexander Korotkov (#104)

On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:

The 0001 patch is intended to improve this situation. Actually, it's
not right to just put RecoveryInProgress() after
GetXLogReplayRecPtr(), because more wal could be replayed between
these calls. Instead we need to recheck GetXLogReplayRecPtr() after
getting negative result of RecoveryInProgress() because WAL replay
position couldn't get updated after.
0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
0003 patch comprises tests for pg_wal_replay_wait() errors.

Before adding more tests, could it be possible to stabilize what's in
the tree? drongo has reported one failure with the recovery test
043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2024-08-05%2004%3A24%3A54
--
Michael

#106Alexander Korotkov
aekorotkov@gmail.com
In reply to: Michael Paquier (#105)

On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:

The 0001 patch is intended to improve this situation. Actually, it's
not right to just put RecoveryInProgress() after
GetXLogReplayRecPtr(), because more wal could be replayed between
these calls. Instead we need to recheck GetXLogReplayRecPtr() after
getting negative result of RecoveryInProgress() because WAL replay
position couldn't get updated after.
0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
0003 patch comprises tests for pg_wal_replay_wait() errors.

Before adding more tests, could it be possible to stabilize what's in
the tree? drongo has reported one failure with the recovery test
043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2024-08-05%2004%3A24%3A54

Thank you for pointing!
Surely, I'll fix this before.

------
Regards,
Alexander Korotkov
Supabase

#107Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#106)

On Tue, Aug 6, 2024 at 11:18 AM Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz>

wrote:

On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:

The 0001 patch is intended to improve this situation. Actually, it's
not right to just put RecoveryInProgress() after
GetXLogReplayRecPtr(), because more wal could be replayed between
these calls. Instead we need to recheck GetXLogReplayRecPtr() after
getting negative result of RecoveryInProgress() because WAL replay
position couldn't get updated after.
0002 patch comprises fix for the header comment of

WaitLSNSetLatches() function

0003 patch comprises tests for pg_wal_replay_wait() errors.

Before adding more tests, could it be possible to stabilize what's in
the tree? drongo has reported one failure with the recovery test
043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2024-08-05%2004%3A24%3A54

Thank you for pointing!
Surely, I'll fix this before.

Something breaks in these lines during second iteration of the loop.
"SELECT pg_current_wal_insert_lsn()" has been queried from primary, but
standby didn't receive "CALL pg_wal_replay_wait('...');"

for (my $i = 0; $i < 5; $i++)
{
print($i);
$node_primary->safe_psql('postgres',
"INSERT INTO wait_test VALUES (${i});");
my $lsn =
$node_primary->safe_psql('postgres',
"SELECT pg_current_wal_insert_lsn()");
$psql_sessions[$i] = $node_standby1->background_psql('postgres');
$psql_sessions[$i]->query_until(
qr/start/, qq[
\\echo start
CALL pg_wal_replay_wait('${lsn}');
SELECT log_count(${i});
]);
}

I wonder what could it be. Probably something hangs inside launching
background psql... I'll investigate this more.

------
Regards,
Alexander Korotkov
Supabase

#108Alexander Korotkov
aekorotkov@gmail.com
In reply to: Michael Paquier (#105)

On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:

The 0001 patch is intended to improve this situation. Actually, it's
not right to just put RecoveryInProgress() after
GetXLogReplayRecPtr(), because more wal could be replayed between
these calls. Instead we need to recheck GetXLogReplayRecPtr() after
getting negative result of RecoveryInProgress() because WAL replay
position couldn't get updated after.
0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
0003 patch comprises tests for pg_wal_replay_wait() errors.

Before adding more tests, could it be possible to stabilize what's in
the tree? drongo has reported one failure with the recovery test
043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2024-08-05%2004%3A24%3A54

I'm currently running a 043_wal_replay_wait test in a loop of drongo.
No failures during more than 10 hours. As I pointed in [1] it seems
that test stuck somewhere on launching BackgroundPsql. Given that
drongo have some strange failures from time to time (for instance [2]
or [3]), I doubt there is something specifically wrong in
043_wal_replay_wait test that caused the subject failure.

Therefore, while I'm going to continue looking at the reason of
failure on drongo in background, I'm going to go ahead with my
improvements for pg_wal_replay_wait().

Links.
1. /messages/by-id/CAPpHfduYkve0sw-qy4aCCmJv_MXfuuAQ7wyRQsX8NjaLVKDE1Q@mail.gmail.com
2. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2024-08-02%2010%3A34%3A45
3. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2024-06-06%2012%3A36%3A11

------
Regards,
Alexander Korotkov
Supabase

#109Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#104)
3 attachment(s)

On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Sat, Aug 3, 2024 at 6:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote:

In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr?
If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.

This is intentional. After standby gets promoted,
GetXLogReplayRecPtr() returns the last WAL position being replayed
while being standby. So, if standby reached target lsn before being
promoted, we don't have to throw an error.

But this gave me an idea that before the loop we probably need to put
RecoveryInProgress() check after GetXLogReplayRecPtr() too. I'll
recheck that.

The attached patchset comprises assorted improvements for pg_wal_replay_wait().

The 0001 patch is intended to improve this situation. Actually, it's
not right to just put RecoveryInProgress() after
GetXLogReplayRecPtr(), because more wal could be replayed between
these calls. Instead we need to recheck GetXLogReplayRecPtr() after
getting negative result of RecoveryInProgress() because WAL replay
position couldn't get updated after.
0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
0003 patch comprises tests for pg_wal_replay_wait() errors.

Here is a revised version of the patchset. I've fixed some typos,
identation, etc. I'm going to push this once it passes cfbot.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v2-0002-Improve-header-comment-for-WaitLSNSetLatches.patchapplication/octet-stream; name=v2-0002-Improve-header-comment-for-WaitLSNSetLatches.patchDownload
From de4e07c1446d666d505f0b2049b86411be43ef3f Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 6 Aug 2024 04:26:44 +0300
Subject: [PATCH v2 2/3] Improve header comment for WaitLSNSetLatches()

Reflect the fact that we remove waiters from the heap, not just set their
latches.
---
 src/backend/commands/waitlsn.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 6651801dfb4..26a4181a9e5 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -147,8 +147,9 @@ deleteLSNWaiter(void)
 }
 
 /*
- * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
- * LSN waiters when InvalidXLogRecPtr is given.
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
  */
 void
 WaitLSNSetLatches(XLogRecPtr currentLSN)
-- 
2.39.3 (Apple Git-146)

v2-0001-Adjust-pg_wal_replay_wait-procedure-behavior-on-p.patchapplication/octet-stream; name=v2-0001-Adjust-pg_wal_replay_wait-procedure-behavior-on-p.patchDownload
From a9b2d27914421cbbc522b3783f2b506e16559577 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 6 Aug 2024 03:52:39 +0300
Subject: [PATCH v2 1/3] Adjust pg_wal_replay_wait() procedure behavior on
 promoted standby

pg_wal_replay_wait() is intended to be called on standby.  However, standby
can be promoted to primary at any moment, even concurrently with the
pg_wal_replay_wait() call.  If recovery is not currently in progress
that doesn't mean the wait was unsuccessful.  Thus, we always need to recheck
if the target LSN is replayed.

Reported-by: Kevin Hale Boyes
Discussion: https://postgr.es/m/CAPpHfdu5QN%2BZGACS%2B7foxmr8_nekgA2PA%2B-G3BuOUrdBLBFb6Q%40mail.gmail.com
Author: Alexander Korotkov
---
 doc/src/sgml/func.sgml                     |  9 +++++
 src/backend/commands/waitlsn.c             | 41 +++++++++++++++++-----
 src/test/recovery/t/043_wal_replay_wait.pl | 15 ++++++--
 3 files changed, 54 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0f7154b76ab..968a9985527 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28969,6 +28969,15 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     connection pooler side.
    </para>
 
+   <para>
+    <function>pg_wal_replay_wait</function> should be called on standby.
+    If a user calls <function>pg_wal_replay_wait</function> on primary, it
+    will error out.  However, if <function>pg_wal_replay_wait</function> is
+    called on primary promoted from standby and <literal>target_lsn</literal>
+    was already replayed, then <function>pg_wal_replay_wait</function> just
+    exits immediately.
+   </para>
+
    <para>
     You can use <function>pg_wal_replay_wait</function> to wait for
     the <type>pg_lsn</type> value.  For example, an application could update
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 3170f0792a5..6651801dfb4 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -230,14 +230,27 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
 
 	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return;
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("recovery is not in progress"),
 				 errhint("Waiting for LSN can only be executed during recovery.")));
-
-	/* If target LSN is already replayed, exit immediately */
-	if (targetLSN <= GetXLogReplayRecPtr(NULL))
-		return;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return;
+	}
 
 	if (timeout > 0)
 	{
@@ -257,19 +270,29 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = 0;
 
-		/* Check if the waited LSN has been replayed */
-		currentLSN = GetXLogReplayRecPtr(NULL);
-		if (targetLSN <= currentLSN)
-			break;
-
 		/* Recheck that recovery is still in-progress */
 		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.
+			 */
+			if (targetLSN <= GetXLogReplayRecPtr(NULL))
+				return;
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("recovery is not in progress"),
 					 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
 							   LSN_FORMAT_ARGS(targetLSN),
 							   LSN_FORMAT_ARGS(currentLSN))));
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
 
 		/*
 		 * If the timeout value is specified, calculate the number of
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index e4842730b05..aaa21c40867 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -126,12 +126,18 @@ ok(1, 'multiple LSN waiters reported consistent data');
 
 # 5. Check that the standby promotion terminates the wait on LSN.  Start
 # waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 my $psql_session = $node_standby1->background_psql('postgres');
 $psql_session->query_until(
 	qr/start/, qq[
 	\\echo start
-	CALL pg_wal_replay_wait('${lsn3}');
+	CALL pg_wal_replay_wait('${lsn4}');
 ]);
 
 $log_offset = -s $node_standby1->logfile;
@@ -140,6 +146,11 @@ $node_standby1->wait_for_log('recovery is not in progress', $log_offset);
 
 ok(1, 'got error after standby promote');
 
+$node_standby1->safe_psql('postgres', "CALL pg_wal_replay_wait('${lsn5}');");
+
+ok(1,
+	'wait for already replayed LSN exists immediately even after promotion');
+
 $node_standby1->stop;
 $node_primary->stop;
 
-- 
2.39.3 (Apple Git-146)

v2-0003-Add-tests-for-pg_wal_replay_wait-errors.patchapplication/octet-stream; name=v2-0003-Add-tests-for-pg_wal_replay_wait-errors.patchDownload
From dbf87c3128fe906f65204b8776fa3e492f1ef9d5 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 6 Aug 2024 05:06:58 +0300
Subject: [PATCH v2 3/3] Add tests for pg_wal_replay_wait() errors

Improve test coverage for pg_wal_replay_wait() procedure by adding test
cases when it errors out.
---
 src/test/recovery/t/043_wal_replay_wait.pl | 42 ++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index aaa21c40867..1234ace1a78 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -77,8 +77,46 @@ $node_standby1->psql(
 ok( $stderr =~ /timed out while waiting for target LSN/,
 	"get timeout on waiting for unreachable LSN");
 
+# 4. Check that pg_wal_replay_wait() triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
 
-# 4. Also, check the scenario of multiple LSN waiters.  We make 5 background
+$node_primary->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby1->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; CALL pg_wal_replay_wait('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /pg_wal_replay_wait\(\) must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    CALL pg_wal_replay_wait(target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby1);
+$node_standby1->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /pg_wal_replay_wait\(\) must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -124,7 +162,7 @@ for (my $i = 0; $i < 5; $i++)
 
 ok(1, 'multiple LSN waiters reported consistent data');
 
-# 5. Check that the standby promotion terminates the wait on LSN.  Start
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
 # waiting for an unreachable LSN then promote.  Check the log for the relevant
 # error message.  Also, check that waiting for already replayed LSN doesn't
 # cause an error even after promotion.
-- 
2.39.3 (Apple Git-146)

#110Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#109)
3 attachment(s)

On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Sat, Aug 3, 2024 at 6:07 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Sat, Aug 3, 2024 at 3:45 AM Kevin Hale Boyes <kcboyes@gmail.com> wrote:

In the for loop in WaitForLSNReplay, shouldn't the check for in-recovery be moved up above the call to GetXLogReplayRecPtr?
If we get promoted while waiting for the timeout we could call GetXLogReplayRecPtr while not in recovery.

This is intentional. After standby gets promoted,
GetXLogReplayRecPtr() returns the last WAL position being replayed
while being standby. So, if standby reached target lsn before being
promoted, we don't have to throw an error.

But this gave me an idea that before the loop we probably need to put
RecoveryInProgress() check after GetXLogReplayRecPtr() too. I'll
recheck that.

The attached patchset comprises assorted improvements for pg_wal_replay_wait().

The 0001 patch is intended to improve this situation. Actually, it's
not right to just put RecoveryInProgress() after
GetXLogReplayRecPtr(), because more wal could be replayed between
these calls. Instead we need to recheck GetXLogReplayRecPtr() after
getting negative result of RecoveryInProgress() because WAL replay
position couldn't get updated after.
0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
0003 patch comprises tests for pg_wal_replay_wait() errors.

Here is a revised version of the patchset. I've fixed some typos,
identation, etc. I'm going to push this once it passes cfbot.

The next revison of the patchset fixes uninitialized variable usage
spotted by cfbot.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v3-0002-Improve-header-comment-for-WaitLSNSetLatches.patchapplication/octet-stream; name=v3-0002-Improve-header-comment-for-WaitLSNSetLatches.patchDownload
From 25fa8054c1c5fe1f7b7eb4a18fa40050c37febdc Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 6 Aug 2024 04:26:44 +0300
Subject: [PATCH v3 2/3] Improve header comment for WaitLSNSetLatches()

Reflect the fact that we remove waiters from the heap, not just set their
latches.
---
 src/backend/commands/waitlsn.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index deefbd458c0..d9cf9e7d75e 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -147,8 +147,9 @@ deleteLSNWaiter(void)
 }
 
 /*
- * Set latches of LSN waiters whose LSN has been replayed.  Set latches of all
- * LSN waiters when InvalidXLogRecPtr is given.
+ * Remove waiters whose LSN has been replayed from the heap and set their
+ * latches.  If InvalidXLogRecPtr is given, remove all waiters from the heap
+ * and set latches for all waiters.
  */
 void
 WaitLSNSetLatches(XLogRecPtr currentLSN)
-- 
2.39.3 (Apple Git-146)

v3-0001-Adjust-pg_wal_replay_wait-procedure-behavior-on-p.patchapplication/octet-stream; name=v3-0001-Adjust-pg_wal_replay_wait-procedure-behavior-on-p.patchDownload
From e94fdafc31936c27289ab8d91f7a655ecab088a4 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 6 Aug 2024 03:52:39 +0300
Subject: [PATCH v3 1/3] Adjust pg_wal_replay_wait() procedure behavior on
 promoted standby

pg_wal_replay_wait() is intended to be called on standby.  However, standby
can be promoted to primary at any moment, even concurrently with the
pg_wal_replay_wait() call.  If recovery is not currently in progress
that doesn't mean the wait was unsuccessful.  Thus, we always need to recheck
if the target LSN is replayed.

Reported-by: Kevin Hale Boyes
Discussion: https://postgr.es/m/CAPpHfdu5QN%2BZGACS%2B7foxmr8_nekgA2PA%2B-G3BuOUrdBLBFb6Q%40mail.gmail.com
Author: Alexander Korotkov
---
 doc/src/sgml/func.sgml                     |  9 +++++
 src/backend/commands/waitlsn.c             | 42 +++++++++++++++++-----
 src/test/recovery/t/043_wal_replay_wait.pl | 15 ++++++--
 3 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0f7154b76ab..968a9985527 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28969,6 +28969,15 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
     connection pooler side.
    </para>
 
+   <para>
+    <function>pg_wal_replay_wait</function> should be called on standby.
+    If a user calls <function>pg_wal_replay_wait</function> on primary, it
+    will error out.  However, if <function>pg_wal_replay_wait</function> is
+    called on primary promoted from standby and <literal>target_lsn</literal>
+    was already replayed, then <function>pg_wal_replay_wait</function> just
+    exits immediately.
+   </para>
+
    <para>
     You can use <function>pg_wal_replay_wait</function> to wait for
     the <type>pg_lsn</type> value.  For example, an application could update
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index 3170f0792a5..deefbd458c0 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -230,14 +230,27 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 	Assert(MyProcNumber >= 0 && MyProcNumber < MaxBackends);
 
 	if (!RecoveryInProgress())
+	{
+		/*
+		 * Recovery is not in progress.  Given that we detected this in the
+		 * very first check, this procedure was mistakenly called on primary.
+		 * However, it's possible that standby was promoted concurrently to
+		 * the procedure call, while target LSN is replayed.  So, we still
+		 * check the last replay LSN before reporting an error.
+		 */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return;
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("recovery is not in progress"),
 				 errhint("Waiting for LSN can only be executed during recovery.")));
-
-	/* If target LSN is already replayed, exit immediately */
-	if (targetLSN <= GetXLogReplayRecPtr(NULL))
-		return;
+	}
+	else
+	{
+		/* If target LSN is already replayed, exit immediately */
+		if (targetLSN <= GetXLogReplayRecPtr(NULL))
+			return;
+	}
 
 	if (timeout > 0)
 	{
@@ -257,19 +270,30 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 		int			rc;
 		long		delay_ms = 0;
 
-		/* Check if the waited LSN has been replayed */
-		currentLSN = GetXLogReplayRecPtr(NULL);
-		if (targetLSN <= currentLSN)
-			break;
-
 		/* Recheck that recovery is still in-progress */
 		if (!RecoveryInProgress())
+		{
+			/*
+			 * Recovery was ended, but recheck if target LSN was already
+			 * replayed.
+			 */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				return;
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("recovery is not in progress"),
 					 errdetail("Recovery ended before replaying target LSN %X/%X; last replay LSN %X/%X.",
 							   LSN_FORMAT_ARGS(targetLSN),
 							   LSN_FORMAT_ARGS(currentLSN))));
+		}
+		else
+		{
+			/* Check if the waited LSN has been replayed */
+			currentLSN = GetXLogReplayRecPtr(NULL);
+			if (targetLSN <= currentLSN)
+				break;
+		}
 
 		/*
 		 * If the timeout value is specified, calculate the number of
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index e4842730b05..aaa21c40867 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -126,12 +126,18 @@ ok(1, 'multiple LSN waiters reported consistent data');
 
 # 5. Check that the standby promotion terminates the wait on LSN.  Start
 # waiting for an unreachable LSN then promote.  Check the log for the relevant
-# error message.
+# error message.  Also, check that waiting for already replayed LSN doesn't
+# cause an error even after promotion.
+my $lsn4 =
+  $node_primary->safe_psql('postgres',
+	"SELECT pg_current_wal_insert_lsn() + 10000000000");
+my $lsn5 =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
 my $psql_session = $node_standby1->background_psql('postgres');
 $psql_session->query_until(
 	qr/start/, qq[
 	\\echo start
-	CALL pg_wal_replay_wait('${lsn3}');
+	CALL pg_wal_replay_wait('${lsn4}');
 ]);
 
 $log_offset = -s $node_standby1->logfile;
@@ -140,6 +146,11 @@ $node_standby1->wait_for_log('recovery is not in progress', $log_offset);
 
 ok(1, 'got error after standby promote');
 
+$node_standby1->safe_psql('postgres', "CALL pg_wal_replay_wait('${lsn5}');");
+
+ok(1,
+	'wait for already replayed LSN exists immediately even after promotion');
+
 $node_standby1->stop;
 $node_primary->stop;
 
-- 
2.39.3 (Apple Git-146)

v3-0003-Add-tests-for-pg_wal_replay_wait-errors.patchapplication/octet-stream; name=v3-0003-Add-tests-for-pg_wal_replay_wait-errors.patchDownload
From 4200a98bbebfdf912d521ab7cf920e0d47ede88d Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 6 Aug 2024 05:06:58 +0300
Subject: [PATCH v3 3/3] Add tests for pg_wal_replay_wait() errors

Improve test coverage for pg_wal_replay_wait() procedure by adding test
cases when it errors out.
---
 src/test/recovery/t/043_wal_replay_wait.pl | 42 ++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index aaa21c40867..1234ace1a78 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -77,8 +77,46 @@ $node_standby1->psql(
 ok( $stderr =~ /timed out while waiting for target LSN/,
 	"get timeout on waiting for unreachable LSN");
 
+# 4. Check that pg_wal_replay_wait() triggers an error if called on primary,
+# within another function, or inside a transaction with an isolation level
+# higher than READ COMMITTED.
 
-# 4. Also, check the scenario of multiple LSN waiters.  We make 5 background
+$node_primary->psql(
+	'postgres',
+	"CALL pg_wal_replay_wait('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~ /recovery is not in progress/,
+	"get an error when running on the primary");
+
+$node_standby1->psql(
+	'postgres',
+	"BEGIN ISOLATION LEVEL REPEATABLE READ; CALL pg_wal_replay_wait('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /pg_wal_replay_wait\(\) must be only called without an active or registered snapshot/,
+	"get an error when running in a transaction with an isolation level higher than REPEATABLE READ"
+);
+
+$node_primary->safe_psql(
+	'postgres', qq[
+CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
+  BEGIN
+    CALL pg_wal_replay_wait(target_lsn);
+  END
+\$\$
+LANGUAGE plpgsql;
+]);
+
+$node_primary->wait_for_catchup($node_standby1);
+$node_standby1->psql(
+	'postgres',
+	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+	stderr => \$stderr);
+ok( $stderr =~
+	  /pg_wal_replay_wait\(\) must be only called without an active or registered snapshot/,
+	"get an error when running within another function");
+
+# 5. Also, check the scenario of multiple LSN waiters.  We make 5 background
 # psql sessions each waiting for a corresponding insertion.  When waiting is
 # finished, stored procedures logs if there are visible as many rows as
 # should be.
@@ -124,7 +162,7 @@ for (my $i = 0; $i < 5; $i++)
 
 ok(1, 'multiple LSN waiters reported consistent data');
 
-# 5. Check that the standby promotion terminates the wait on LSN.  Start
+# 6. Check that the standby promotion terminates the wait on LSN.  Start
 # waiting for an unreachable LSN then promote.  Check the log for the relevant
 # error message.  Also, check that waiting for already replayed LSN doesn't
 # cause an error even after promotion.
-- 
2.39.3 (Apple Git-146)

#111Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#108)

On Sat, Aug 10, 2024 at 6:58 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote:

The 0001 patch is intended to improve this situation. Actually, it's
not right to just put RecoveryInProgress() after
GetXLogReplayRecPtr(), because more wal could be replayed between
these calls. Instead we need to recheck GetXLogReplayRecPtr() after
getting negative result of RecoveryInProgress() because WAL replay
position couldn't get updated after.
0002 patch comprises fix for the header comment of WaitLSNSetLatches() function
0003 patch comprises tests for pg_wal_replay_wait() errors.

Before adding more tests, could it be possible to stabilize what's in
the tree? drongo has reported one failure with the recovery test
043_wal_replay_wait.pl introduced recently by 3c5db1d6b016:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2024-08-05%2004%3A24%3A54

I'm currently running a 043_wal_replay_wait test in a loop of drongo.
No failures during more than 10 hours. As I pointed in [1] it seems
that test stuck somewhere on launching BackgroundPsql. Given that
drongo have some strange failures from time to time (for instance [2]
or [3]), I doubt there is something specifically wrong in
043_wal_replay_wait test that caused the subject failure.

With help of Andrew Dunstan, I've run 043_wal_replay_wait.pl in a loop
for two days, then the whole test suite also for two days. Haven't
seen any failures. I don't see the point to run more experiments,
because Andrew needs to bring drongo back online as a buildfarm
member. It might happen that something exceptional happened on drongo
(like inability to launch a new process or something). For now, I
think the reasonable strategy would be to wait and see if something
similar will repeat on buildfarm.

------
Regards,
Alexander Korotkov
Supabase

#112Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Korotkov (#110)

Hi Alexander,

10.08.2024 20:18, Alexander Korotkov wrote:

On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
...
Here is a revised version of the patchset. I've fixed some typos,
identation, etc. I'm going to push this once it passes cfbot.

The next revison of the patchset fixes uninitialized variable usage
spotted by cfbot.

When running check-world on a rather slow armv7 device, I came across the
043_wal_replay_wait.pl test failure:
t/043_wal_replay_wait.pl .............. 7/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 8.

regress_log_043_wal_replay_wait contains:
...
01234[21:58:56.370](1.594s) ok 7 - multiple LSN waiters reported consistent data
### Promoting node "standby"
# Running: pg_ctl -D .../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata -l
.../src/test/recovery/tmp_check/log/043_wal_replay_wait_standby.log promote
waiting for server to promote.... done
server promoted
[21:58:56.637](0.268s) ok 8 - got error after standby promote
error running SQL: 'psql:<stdin>:1: ERROR:  recovery is not in progress
HINT:  Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=10228 host=/tmp/Ftj8qpTQht dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL
pg_wal_replay_wait('0/300D0E8');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140.

043_wal_replay_wait_standby.log contains:
2024-09-12 21:58:56.518 UTC [15220:1] [unknown] LOG:  connection received: host=[local]
2024-09-12 21:58:56.520 UTC [15220:2] [unknown] LOG:  connection authenticated: user="android" method=trust
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.520 UTC [15220:3] [unknown] LOG:  connection authorized: user=android database=postgres
application_name=043_wal_replay_wait.pl
2024-09-12 21:58:56.527 UTC [15220:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('2/570CB4E8');
2024-09-12 21:58:56.535 UTC [15123:7] LOG:  received promote request
2024-09-12 21:58:56.535 UTC [15124:2] FATAL:  terminating walreceiver process due to administrator command
2024-09-12 21:58:56.537 UTC [15123:8] LOG:  invalid record length at 0/300D0B0: expected at least 24, got 0
2024-09-12 21:58:56.537 UTC [15123:9] LOG:  redo done at 0/300D088 system usage: CPU: user: 0.01 s, system: 0.00 s,
elapsed: 14.23 s
2024-09-12 21:58:56.537 UTC [15123:10] LOG:  last completed transaction was at log time 2024-09-12 21:58:55.322831+00
2024-09-12 21:58:56.540 UTC [15123:11] LOG:  selected new timeline ID: 2
2024-09-12 21:58:56.589 UTC [15123:12] LOG:  archive recovery complete
2024-09-12 21:58:56.590 UTC [15220:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-12 21:58:56.590 UTC [15220:6] 043_wal_replay_wait.pl DETAIL:  Recovery ended before replaying target LSN
2/570CB4E8; last replay LSN 0/300D0B0.
2024-09-12 21:58:56.591 UTC [15121:1] LOG:  checkpoint starting: force
2024-09-12 21:58:56.592 UTC [15220:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.075 user=android
database=postgres host=[local]
2024-09-12 21:58:56.595 UTC [15120:4] LOG:  database system is ready to accept connections
2024-09-12 21:58:56.665 UTC [15227:1] [unknown] LOG:  connection received: host=[local]
2024-09-12 21:58:56.668 UTC [15227:2] [unknown] LOG:  connection authenticated: user="android" method=trust
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.668 UTC [15227:3] [unknown] LOG:  connection authorized: user=android database=postgres
application_name=043_wal_replay_wait.pl
2024-09-12 21:58:56.675 UTC [15227:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('0/300D0E8');
2024-09-12 21:58:56.677 UTC [15227:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-12 21:58:56.677 UTC [15227:6] 043_wal_replay_wait.pl HINT: Waiting for LSN can only be executed during recovery.
2024-09-12 21:58:56.679 UTC [15227:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.015 user=android
database=postgres host=[local]

Note that last replay LSN is 300D0B0, but the latter pg_wal_replay_wait
call wants to wait for 300D0E8.

pg_waldump -p src/test/recovery/tmp_check/t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 000000010000000000000003
shows:
rmgr: Heap        len (rec/tot):     59/    59, tx:        748, lsn: 0/0300D048, prev 0/0300D020, desc: INSERT off: 35,
flags: 0x00, blkref #0: rel 1663/5/16384 blk 0
rmgr: Transaction len (rec/tot):     34/    34, tx:        748, lsn: 0/0300D088, prev 0/0300D048, desc: COMMIT
2024-09-12 21:58:55.322831 UTC
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0300D0B0, prev 0/0300D088, desc: RUNNING_XACTS
nextXid 749 latestCompletedXid 748 oldestRunningXid 749

I could reproduce this failure on my workstation with bgworker modified
as below:
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -69 +69 @@ int            BgWriterDelay = 200;
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 15
@@ -307 +307 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
-                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);

When looking at the test, I noticed probably a typo in the test message:
wait for already replayed LSN exists immediately ...
shouldn't it be "exits" there (though maybe the whole phrase could be
improved)?

I also suspect that "achieve" is not suitable word in the context of LSNs
and timeouts. Maybe you would find it appropriate to replace it with
"reach"?

Best regards,
Alexander

#113Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Lakhin (#112)
2 attachment(s)

Hi, Alexander!

On Fri, Sep 13, 2024 at 3:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

10.08.2024 20:18, Alexander Korotkov wrote:

On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
...
Here is a revised version of the patchset. I've fixed some typos,
identation, etc. I'm going to push this once it passes cfbot.

The next revison of the patchset fixes uninitialized variable usage
spotted by cfbot.

When running check-world on a rather slow armv7 device, I came across the
043_wal_replay_wait.pl test failure:
t/043_wal_replay_wait.pl .............. 7/? # Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 8.

regress_log_043_wal_replay_wait contains:
...
01234[21:58:56.370](1.594s) ok 7 - multiple LSN waiters reported consistent data
### Promoting node "standby"
# Running: pg_ctl -D .../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata -l
.../src/test/recovery/tmp_check/log/043_wal_replay_wait_standby.log promote
waiting for server to promote.... done
server promoted
[21:58:56.637](0.268s) ok 8 - got error after standby promote
error running SQL: 'psql:<stdin>:1: ERROR: recovery is not in progress
HINT: Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=10228 host=/tmp/Ftj8qpTQht dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL
pg_wal_replay_wait('0/300D0E8');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140.

043_wal_replay_wait_standby.log contains:
2024-09-12 21:58:56.518 UTC [15220:1] [unknown] LOG: connection received: host=[local]
2024-09-12 21:58:56.520 UTC [15220:2] [unknown] LOG: connection authenticated: user="android" method=trust
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.520 UTC [15220:3] [unknown] LOG: connection authorized: user=android database=postgres
application_name=043_wal_replay_wait.pl
2024-09-12 21:58:56.527 UTC [15220:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('2/570CB4E8');
2024-09-12 21:58:56.535 UTC [15123:7] LOG: received promote request
2024-09-12 21:58:56.535 UTC [15124:2] FATAL: terminating walreceiver process due to administrator command
2024-09-12 21:58:56.537 UTC [15123:8] LOG: invalid record length at 0/300D0B0: expected at least 24, got 0
2024-09-12 21:58:56.537 UTC [15123:9] LOG: redo done at 0/300D088 system usage: CPU: user: 0.01 s, system: 0.00 s,
elapsed: 14.23 s
2024-09-12 21:58:56.537 UTC [15123:10] LOG: last completed transaction was at log time 2024-09-12 21:58:55.322831+00
2024-09-12 21:58:56.540 UTC [15123:11] LOG: selected new timeline ID: 2
2024-09-12 21:58:56.589 UTC [15123:12] LOG: archive recovery complete
2024-09-12 21:58:56.590 UTC [15220:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-12 21:58:56.590 UTC [15220:6] 043_wal_replay_wait.pl DETAIL: Recovery ended before replaying target LSN
2/570CB4E8; last replay LSN 0/300D0B0.
2024-09-12 21:58:56.591 UTC [15121:1] LOG: checkpoint starting: force
2024-09-12 21:58:56.592 UTC [15220:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.075 user=android
database=postgres host=[local]
2024-09-12 21:58:56.595 UTC [15120:4] LOG: database system is ready to accept connections
2024-09-12 21:58:56.665 UTC [15227:1] [unknown] LOG: connection received: host=[local]
2024-09-12 21:58:56.668 UTC [15227:2] [unknown] LOG: connection authenticated: user="android" method=trust
(.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
2024-09-12 21:58:56.668 UTC [15227:3] [unknown] LOG: connection authorized: user=android database=postgres
application_name=043_wal_replay_wait.pl
2024-09-12 21:58:56.675 UTC [15227:4] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('0/300D0E8');
2024-09-12 21:58:56.677 UTC [15227:5] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-12 21:58:56.677 UTC [15227:6] 043_wal_replay_wait.pl HINT: Waiting for LSN can only be executed during recovery.
2024-09-12 21:58:56.679 UTC [15227:7] 043_wal_replay_wait.pl LOG: disconnection: session time: 0:00:00.015 user=android
database=postgres host=[local]

Note that last replay LSN is 300D0B0, but the latter pg_wal_replay_wait
call wants to wait for 300D0E8.

pg_waldump -p src/test/recovery/tmp_check/t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 000000010000000000000003
shows:
rmgr: Heap len (rec/tot): 59/ 59, tx: 748, lsn: 0/0300D048, prev 0/0300D020, desc: INSERT off: 35,
flags: 0x00, blkref #0: rel 1663/5/16384 blk 0
rmgr: Transaction len (rec/tot): 34/ 34, tx: 748, lsn: 0/0300D088, prev 0/0300D048, desc: COMMIT
2024-09-12 21:58:55.322831 UTC
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0300D0B0, prev 0/0300D088, desc: RUNNING_XACTS
nextXid 749 latestCompletedXid 748 oldestRunningXid 749

I could reproduce this failure on my workstation with bgworker modified
as below:
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -69 +69 @@ int            BgWriterDelay = 200;
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 15
@@ -307 +307 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
-                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);

When looking at the test, I noticed probably a typo in the test message:
wait for already replayed LSN exists immediately ...
shouldn't it be "exits" there (though maybe the whole phrase could be
improved)?

I also suspect that "achieve" is not suitable word in the context of LSNs
and timeouts. Maybe you would find it appropriate to replace it with
"reach"?

Thank you for your report!

Please find two patches attached. The first one does minor cleanup
including misuse of words you've pointed. The second one adds missing
wait_for_catchup(). That should fix the test failure you've spotted.
Please, check if it fixes an issue for you.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v1-0002-Add-missing-wait_for_catchup-call-in-043_wal_repl.patchapplication/octet-stream; name=v1-0002-Add-missing-wait_for_catchup-call-in-043_wal_repl.patchDownload
From 6f2e7a5e2d4043192609dd4f24f436977abe0da8 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 16 Sep 2024 21:35:06 +0300
Subject: [PATCH v1 2/2] Add missing wait_for_catchup() call in
 043_wal_replay_wait.pl

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/1d7b08f2-64a2-77fb-c666-c9a74c68eeda%40gmail.com
---
 src/test/recovery/t/043_wal_replay_wait.pl | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index b4801635d53..3d09a145fe7 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -177,6 +177,7 @@ $psql_session->query_until(
 	CALL pg_wal_replay_wait('${lsn4}');
 ]);
 
+$node_primary->wait_for_catchup($node_standby);
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
 $node_standby->wait_for_log('recovery is not in progress', $log_offset);
-- 
2.39.3 (Apple Git-146)

v1-0001-Minor-cleanup-related-to-pg_wal_replay_wait-proce.patchapplication/octet-stream; name=v1-0001-Minor-cleanup-related-to-pg_wal_replay_wait-proce.patchDownload
From 0ba59157a6946f3dc4b2119f90051ec6f48ce5d0 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 16 Sep 2024 21:30:15 +0300
Subject: [PATCH v1 1/2] Minor cleanup related to pg_wal_replay_wait()
 procedure

 * Rename $node_standby1 to $node_standby in 043_wal_replay_wait.pl as there
   is only one standby.
 * Remove useless debug printing in 043_wal_replay_wait.pl.
 * Fix typo in one check description in 043_wal_replay_wait.pl.
 * Fix couple of wordings in waitlsn.c comments.

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/1d7b08f2-64a2-77fb-c666-c9a74c68eeda%40gmail.com
---
 src/backend/commands/waitlsn.c             |  4 +-
 src/test/recovery/t/043_wal_replay_wait.pl | 47 +++++++++++-----------
 2 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index d7065726749..ffa83a20008 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -299,7 +299,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 		/*
 		 * If the timeout value is specified, calculate the number of
 		 * milliseconds before the timeout.  Exit if the timeout is already
-		 * achieved.
+		 * reached.
 		 */
 		if (timeout > 0)
 		{
@@ -325,7 +325,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 	deleteLSNWaiter();
 
 	/*
-	 * If we didn't achieve the target LSN, we must be exited by timeout.
+	 * If we didn't reach the target LSN, we must be exited by timeout.
 	 */
 	if (targetLSN > currentLSN)
 	{
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index 1234ace1a78..b4801635d53 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -19,15 +19,15 @@ my $backup_name = 'my_backup';
 $node_primary->backup($backup_name);
 
 # Create a streaming standby with a 1 second delay from the backup
-my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $delay = 1;
-$node_standby1->init_from_backup($node_primary, $backup_name,
+$node_standby->init_from_backup($node_primary, $backup_name,
 	has_streaming => 1);
-$node_standby1->append_conf(
+$node_standby->append_conf(
 	'postgresql.conf', qq[
 	recovery_min_apply_delay = '${delay}s'
 ]);
-$node_standby1->start;
+$node_standby->start;
 
 # 1. Make sure that pg_wal_replay_wait() works: add new content to
 # primary and memorize primary's insert LSN, then wait for that LSN to be
@@ -36,7 +36,7 @@ $node_primary->safe_psql('postgres',
 	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
 my $lsn1 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $output = $node_standby1->safe_psql(
+my $output = $node_standby->safe_psql(
 	'postgres', qq[
 	CALL pg_wal_replay_wait('${lsn1}', 1000000);
 	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
@@ -52,7 +52,7 @@ $node_primary->safe_psql('postgres',
 	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
 my $lsn2 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-$output = $node_standby1->safe_psql(
+$output = $node_standby->safe_psql(
 	'postgres', qq[
 	CALL pg_wal_replay_wait('${lsn2}');
 	SELECT count(*) FROM wait_test;
@@ -68,9 +68,9 @@ my $lsn3 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
 my $stderr;
-$node_standby1->safe_psql('postgres',
+$node_standby->safe_psql('postgres',
 	"CALL pg_wal_replay_wait('${lsn2}', 10);");
-$node_standby1->psql(
+$node_standby->psql(
 	'postgres',
 	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
 	stderr => \$stderr);
@@ -88,7 +88,7 @@ $node_primary->psql(
 ok( $stderr =~ /recovery is not in progress/,
 	"get an error when running on the primary");
 
-$node_standby1->psql(
+$node_standby->psql(
 	'postgres',
 	"BEGIN ISOLATION LEVEL REPEATABLE READ; CALL pg_wal_replay_wait('${lsn3}');",
 	stderr => \$stderr);
@@ -107,8 +107,8 @@ CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
 LANGUAGE plpgsql;
 ]);
 
-$node_primary->wait_for_catchup($node_standby1);
-$node_standby1->psql(
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
 	'postgres',
 	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
 	stderr => \$stderr);
@@ -134,17 +134,16 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
 \$\$
 LANGUAGE plpgsql;
 ]);
-$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
-	print($i);
 	$node_primary->safe_psql('postgres',
 		"INSERT INTO wait_test VALUES (${i});");
 	my $lsn =
 	  $node_primary->safe_psql('postgres',
 		"SELECT pg_current_wal_insert_lsn()");
-	$psql_sessions[$i] = $node_standby1->background_psql('postgres');
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
 	$psql_sessions[$i]->query_until(
 		qr/start/, qq[
 		\\echo start
@@ -152,11 +151,11 @@ for (my $i = 0; $i < 5; $i++)
 		SELECT log_count(${i});
 	]);
 }
-my $log_offset = -s $node_standby1->logfile;
-$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
 {
-	$node_standby1->wait_for_log("count ${i}", $log_offset);
+	$node_standby->wait_for_log("count ${i}", $log_offset);
 	$psql_sessions[$i]->quit;
 }
 
@@ -171,25 +170,25 @@ my $lsn4 =
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby1->background_psql('postgres');
+my $psql_session = $node_standby->background_psql('postgres');
 $psql_session->query_until(
 	qr/start/, qq[
 	\\echo start
 	CALL pg_wal_replay_wait('${lsn4}');
 ]);
 
-$log_offset = -s $node_standby1->logfile;
-$node_standby1->promote;
-$node_standby1->wait_for_log('recovery is not in progress', $log_offset);
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
 ok(1, 'got error after standby promote');
 
-$node_standby1->safe_psql('postgres', "CALL pg_wal_replay_wait('${lsn5}');");
+$node_standby->safe_psql('postgres', "CALL pg_wal_replay_wait('${lsn5}');");
 
 ok(1,
-	'wait for already replayed LSN exists immediately even after promotion');
+	'wait for already replayed LSN exits immediately even after promotion');
 
-$node_standby1->stop;
+$node_standby->stop;
 $node_primary->stop;
 
 # If we send \q with $psql_session->quit the command can be sent to the session
-- 
2.39.3 (Apple Git-146)

#114Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Korotkov (#113)

Hi Alexander,

16.09.2024 21:55, Alexander Korotkov wrote:

Please find two patches attached. The first one does minor cleanup
including misuse of words you've pointed. The second one adds missing
wait_for_catchup(). That should fix the test failure you've spotted.
Please, check if it fixes an issue for you.

Thank you for looking at that!

Unfortunately, the issue is still here — the test failed for me 6 out of
10 runs, as below:
[05:14:02.807](0.135s) ok 8 - got error after standby promote
error running SQL: 'psql:<stdin>:1: ERROR:  recovery is not in progress
HINT:  Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=12734 host=/tmp/04hQ75NuXf dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL
pg_wal_replay_wait('0/300F248');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140.

043_wal_replay_wait_standby.log:
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl DETAIL:  Recovery ended before replaying target LSN
2/570CD648; last replay LSN 0/300F210.
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl STATEMENT:  CALL pg_wal_replay_wait('2/570CD648');
2024-09-17 05:14:02.714 UTC [1817155] LOG:  checkpoint starting: force
2024-09-17 05:14:02.714 UTC [1817154] LOG:  database system is ready to accept connections
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('0/300F248');
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl ERROR: recovery is not in progress

pg_waldump -p .../t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 000000010000000000000003
rmgr: Transaction len (rec/tot):     34/    34, tx:        748, lsn: 0/0300F1E8, prev 0/0300F1A8, desc: COMMIT
2024-09-17 05:14:01.654874 UTC
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0300F210, prev 0/0300F1E8, desc: RUNNING_XACTS
nextXid 749 latestCompletedXid 748 oldestRunningXid 749

I wonder, can you reproduce this with that bgwriter's modification?

I've also found two more "achievements" coined by 3c5db1d6b:
doc/src/sgml/func.sgml:   It may also happen that target <acronym>lsn</acronym> is not achieved
src/backend/access/transam/xlog.c-       * recovery was ended before achieving the target LSN.

Best regards,
Alexander

#115Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Lakhin (#114)
2 attachment(s)

On Tue, Sep 17, 2024 at 9:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

16.09.2024 21:55, Alexander Korotkov wrote:

Please find two patches attached. The first one does minor cleanup
including misuse of words you've pointed. The second one adds missing
wait_for_catchup(). That should fix the test failure you've spotted.
Please, check if it fixes an issue for you.

Thank you for looking at that!

Unfortunately, the issue is still here — the test failed for me 6 out of
10 runs, as below:
[05:14:02.807](0.135s) ok 8 - got error after standby promote
error running SQL: 'psql:<stdin>:1: ERROR: recovery is not in progress
HINT: Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=12734 host=/tmp/04hQ75NuXf dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL
pg_wal_replay_wait('0/300F248');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140.

043_wal_replay_wait_standby.log:
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl ERROR: recovery is not in progress
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl DETAIL: Recovery ended before replaying target LSN
2/570CD648; last replay LSN 0/300F210.
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl STATEMENT: CALL pg_wal_replay_wait('2/570CD648');
2024-09-17 05:14:02.714 UTC [1817155] LOG: checkpoint starting: force
2024-09-17 05:14:02.714 UTC [1817154] LOG: database system is ready to accept connections
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl LOG: statement: CALL pg_wal_replay_wait('0/300F248');
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl ERROR: recovery is not in progress

pg_waldump -p .../t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 000000010000000000000003
rmgr: Transaction len (rec/tot): 34/ 34, tx: 748, lsn: 0/0300F1E8, prev 0/0300F1A8, desc: COMMIT
2024-09-17 05:14:01.654874 UTC
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 0/0300F210, prev 0/0300F1E8, desc: RUNNING_XACTS
nextXid 749 latestCompletedXid 748 oldestRunningXid 749

I wonder, can you reproduce this with that bgwriter's modification?

Yes, now I did reproduce. I got that the problem could be that insert
LSN is not yet written at primary, thus wait_for_catchup doesn't wait
for it. I've workarounded that using pg_switch_wal(). The revised
patchset is attached.

I've also found two more "achievements" coined by 3c5db1d6b:
doc/src/sgml/func.sgml: It may also happen that target <acronym>lsn</acronym> is not achieved
src/backend/access/transam/xlog.c- * recovery was ended before achieving the target LSN.

Fixed this at well in 0001.

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v2-0002-Ensure-standby-promotion-point-in-043_wal_replay_.patchapplication/octet-stream; name=v2-0002-Ensure-standby-promotion-point-in-043_wal_replay_.patchDownload
From 95001dba4ed9c60d699338f3dc85ce836fe7d702 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 16 Sep 2024 21:35:06 +0300
Subject: [PATCH v2 2/2] Ensure standby promotion point in
 043_wal_replay_wait.pl

This commit ensures standby will be promoted at least at the primary insert
LSN we have just observed.  We use pg_switch_wal() to force the insert LSN
to be written then wait for standby to catchup.

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/1d7b08f2-64a2-77fb-c666-c9a74c68eeda%40gmail.com
---
 doc/src/sgml/func.sgml                     | 2 +-
 src/backend/access/transam/xlog.c          | 2 +-
 src/test/recovery/t/043_wal_replay_wait.pl | 6 ++++++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 84eb3a45eea..6f75bd0c7d2 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -29076,7 +29076,7 @@ postgres=# SELECT * FROM movie WHERE genre = 'Drama';
 (0 rows)
    </programlisting>
 
-   It may also happen that target <acronym>lsn</acronym> is not achieved
+   It may also happen that target <acronym>lsn</acronym> is not reached
    within the timeout.  In that case the error is thrown.
 
    <programlisting>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e96aa4d1cb1..853ab06812b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6168,7 +6168,7 @@ StartupXLOG(void)
 
 	/*
 	 * Wake up all waiters for replay LSN.  They need to report an error that
-	 * recovery was ended before achieving the target LSN.
+	 * recovery was ended before reaching the target LSN.
 	 */
 	WaitLSNSetLatches(InvalidXLogRecPtr);
 
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index b4801635d53..0b5f6e388e4 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -177,6 +177,12 @@ $psql_session->query_until(
 	CALL pg_wal_replay_wait('${lsn4}');
 ]);
 
+# Make sure standby will be promoted at least at the primary insert LSN we
+# have just observed.  Use pg_switch_wal() to force the insert LSN to be
+# written then wait for standby to catchup.
+$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal();');
+$node_primary->wait_for_catchup($node_standby);
+
 $log_offset = -s $node_standby->logfile;
 $node_standby->promote;
 $node_standby->wait_for_log('recovery is not in progress', $log_offset);
-- 
2.39.5 (Apple Git-154)

v2-0001-Minor-cleanup-related-to-pg_wal_replay_wait-proce.patchapplication/octet-stream; name=v2-0001-Minor-cleanup-related-to-pg_wal_replay_wait-proce.patchDownload
From 0ba59157a6946f3dc4b2119f90051ec6f48ce5d0 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 16 Sep 2024 21:30:15 +0300
Subject: [PATCH v2 1/2] Minor cleanup related to pg_wal_replay_wait()
 procedure

 * Rename $node_standby1 to $node_standby in 043_wal_replay_wait.pl as there
   is only one standby.
 * Remove useless debug printing in 043_wal_replay_wait.pl.
 * Fix typo in one check description in 043_wal_replay_wait.pl.
 * Fix couple of wordings in waitlsn.c comments.

Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/1d7b08f2-64a2-77fb-c666-c9a74c68eeda%40gmail.com
---
 src/backend/commands/waitlsn.c             |  4 +-
 src/test/recovery/t/043_wal_replay_wait.pl | 47 +++++++++++-----------
 2 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index d7065726749..ffa83a20008 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -299,7 +299,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 		/*
 		 * If the timeout value is specified, calculate the number of
 		 * milliseconds before the timeout.  Exit if the timeout is already
-		 * achieved.
+		 * reached.
 		 */
 		if (timeout > 0)
 		{
@@ -325,7 +325,7 @@ WaitForLSNReplay(XLogRecPtr targetLSN, int64 timeout)
 	deleteLSNWaiter();
 
 	/*
-	 * If we didn't achieve the target LSN, we must be exited by timeout.
+	 * If we didn't reach the target LSN, we must be exited by timeout.
 	 */
 	if (targetLSN > currentLSN)
 	{
diff --git a/src/test/recovery/t/043_wal_replay_wait.pl b/src/test/recovery/t/043_wal_replay_wait.pl
index 1234ace1a78..b4801635d53 100644
--- a/src/test/recovery/t/043_wal_replay_wait.pl
+++ b/src/test/recovery/t/043_wal_replay_wait.pl
@@ -19,15 +19,15 @@ my $backup_name = 'my_backup';
 $node_primary->backup($backup_name);
 
 # Create a streaming standby with a 1 second delay from the backup
-my $node_standby1 = PostgreSQL::Test::Cluster->new('standby');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $delay = 1;
-$node_standby1->init_from_backup($node_primary, $backup_name,
+$node_standby->init_from_backup($node_primary, $backup_name,
 	has_streaming => 1);
-$node_standby1->append_conf(
+$node_standby->append_conf(
 	'postgresql.conf', qq[
 	recovery_min_apply_delay = '${delay}s'
 ]);
-$node_standby1->start;
+$node_standby->start;
 
 # 1. Make sure that pg_wal_replay_wait() works: add new content to
 # primary and memorize primary's insert LSN, then wait for that LSN to be
@@ -36,7 +36,7 @@ $node_primary->safe_psql('postgres',
 	"INSERT INTO wait_test VALUES (generate_series(11, 20))");
 my $lsn1 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $output = $node_standby1->safe_psql(
+my $output = $node_standby->safe_psql(
 	'postgres', qq[
 	CALL pg_wal_replay_wait('${lsn1}', 1000000);
 	SELECT pg_lsn_cmp(pg_last_wal_replay_lsn(), '${lsn1}'::pg_lsn);
@@ -52,7 +52,7 @@ $node_primary->safe_psql('postgres',
 	"INSERT INTO wait_test VALUES (generate_series(21, 30))");
 my $lsn2 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-$output = $node_standby1->safe_psql(
+$output = $node_standby->safe_psql(
 	'postgres', qq[
 	CALL pg_wal_replay_wait('${lsn2}');
 	SELECT count(*) FROM wait_test;
@@ -68,9 +68,9 @@ my $lsn3 =
   $node_primary->safe_psql('postgres',
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
 my $stderr;
-$node_standby1->safe_psql('postgres',
+$node_standby->safe_psql('postgres',
 	"CALL pg_wal_replay_wait('${lsn2}', 10);");
-$node_standby1->psql(
+$node_standby->psql(
 	'postgres',
 	"CALL pg_wal_replay_wait('${lsn3}', 1000);",
 	stderr => \$stderr);
@@ -88,7 +88,7 @@ $node_primary->psql(
 ok( $stderr =~ /recovery is not in progress/,
 	"get an error when running on the primary");
 
-$node_standby1->psql(
+$node_standby->psql(
 	'postgres',
 	"BEGIN ISOLATION LEVEL REPEATABLE READ; CALL pg_wal_replay_wait('${lsn3}');",
 	stderr => \$stderr);
@@ -107,8 +107,8 @@ CREATE FUNCTION pg_wal_replay_wait_wrap(target_lsn pg_lsn) RETURNS void AS \$\$
 LANGUAGE plpgsql;
 ]);
 
-$node_primary->wait_for_catchup($node_standby1);
-$node_standby1->psql(
+$node_primary->wait_for_catchup($node_standby);
+$node_standby->psql(
 	'postgres',
 	"SELECT pg_wal_replay_wait_wrap('${lsn3}');",
 	stderr => \$stderr);
@@ -134,17 +134,16 @@ CREATE FUNCTION log_count(i int) RETURNS void AS \$\$
 \$\$
 LANGUAGE plpgsql;
 ]);
-$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause();");
 my @psql_sessions;
 for (my $i = 0; $i < 5; $i++)
 {
-	print($i);
 	$node_primary->safe_psql('postgres',
 		"INSERT INTO wait_test VALUES (${i});");
 	my $lsn =
 	  $node_primary->safe_psql('postgres',
 		"SELECT pg_current_wal_insert_lsn()");
-	$psql_sessions[$i] = $node_standby1->background_psql('postgres');
+	$psql_sessions[$i] = $node_standby->background_psql('postgres');
 	$psql_sessions[$i]->query_until(
 		qr/start/, qq[
 		\\echo start
@@ -152,11 +151,11 @@ for (my $i = 0; $i < 5; $i++)
 		SELECT log_count(${i});
 	]);
 }
-my $log_offset = -s $node_standby1->logfile;
-$node_standby1->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
+my $log_offset = -s $node_standby->logfile;
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_resume();");
 for (my $i = 0; $i < 5; $i++)
 {
-	$node_standby1->wait_for_log("count ${i}", $log_offset);
+	$node_standby->wait_for_log("count ${i}", $log_offset);
 	$psql_sessions[$i]->quit;
 }
 
@@ -171,25 +170,25 @@ my $lsn4 =
 	"SELECT pg_current_wal_insert_lsn() + 10000000000");
 my $lsn5 =
   $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()");
-my $psql_session = $node_standby1->background_psql('postgres');
+my $psql_session = $node_standby->background_psql('postgres');
 $psql_session->query_until(
 	qr/start/, qq[
 	\\echo start
 	CALL pg_wal_replay_wait('${lsn4}');
 ]);
 
-$log_offset = -s $node_standby1->logfile;
-$node_standby1->promote;
-$node_standby1->wait_for_log('recovery is not in progress', $log_offset);
+$log_offset = -s $node_standby->logfile;
+$node_standby->promote;
+$node_standby->wait_for_log('recovery is not in progress', $log_offset);
 
 ok(1, 'got error after standby promote');
 
-$node_standby1->safe_psql('postgres', "CALL pg_wal_replay_wait('${lsn5}');");
+$node_standby->safe_psql('postgres', "CALL pg_wal_replay_wait('${lsn5}');");
 
 ok(1,
-	'wait for already replayed LSN exists immediately even after promotion');
+	'wait for already replayed LSN exits immediately even after promotion');
 
-$node_standby1->stop;
+$node_standby->stop;
 $node_primary->stop;
 
 # If we send \q with $psql_session->quit the command can be sent to the session
-- 
2.39.5 (Apple Git-154)

#116Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Korotkov (#115)

17.09.2024 10:47, Alexander Korotkov wrote:

Yes, now I did reproduce. I got that the problem could be that insert
LSN is not yet written at primary, thus wait_for_catchup doesn't wait
for it. I've workarounded that using pg_switch_wal(). The revised
patchset is attached.

Thank you for the revised patch!

The improved test works reliably for me (100 out of 100 runs passed),

Best regards,
Alexander