Causal reads take II

Started by Thomas Munroabout 9 years ago23 messages

thomas.munro@enterprisedb.com

about 9 years ago

2 attachment(s)

Hi hackers,

Here is a new version of my "causal reads" patch (see the earlier
thread from the 9.6 development cycle[1]/messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com), which provides a way to
avoid stale reads when load balancing with streaming replication.

To try it out:

Set up a primary and some standbys, and put "causal_reads_timeout =
4s" in the primary's postgresql.conf. Then SET causal_reads = on and
experiment with various workloads, watching your master's log and
looking at pg_stat_replication. For example you could try out
test-causal-reads.c with --causal-reads --check (from the earlier
thread) or write something similar, and verify the behaviour while
killing, pausing, overwhelming servers etc.

Here's a brief restatement of the problem I'm trying to address and
how this patch works:

In 9.6 we got a new synchronous_commit level "remote_apply", which
causes committing transactions to block until the commit record has
been applied on the current synchronous standby server. In 10devel
can now be servers plural. That's useful because it means that a
client can run tx1 on the primary and then run tx2 on an appropriate
standby, or cause some other client to do so, and be sure that tx2 can
see tx1. Tx2 can be said to be "causally dependent" on tx1 because
clients expect tx2 to see tx1, because they know that tx1 happened
before tx2.

In practice there are complications relating to failure and
transitions. How should you find an appropriate standby? Suppose you
have a primary and N standbys, you set synchronous_standby_names to
wait for all N standbys, and you set synchronous_commit to
remote_apply. Then the above guarantee of visibility of tx1 by tx2
works, no matter which server you run tx2 on. Unfortunately, if one
of your standby servers fails or there is a network partition, all
commits will block until you fix that. So you probably want to set
synchronous_standby_names to wait for a subset of your set of
standbys. Now you can lose some number of standby servers without
holding up commits on the primary, but the visibility guarantee for
causal dependencies is lost! How can a client know for certain
whether tx2 run on any given standby can see a transaction tx1 that it
has heard about? If you're using the new "ANY n" mode then the subset
of standbys that have definitely applied tx1 is not known to any
client; if you're using the traditional FIRST mode it's complicated
during transitions (you might be talking to a standby that has
recently lost its link to the primary and the primary could have
decided to wait for the next highest priority standby instead and then
returned from COMMIT successfully).

This patch provides the following guarantee: if causal_reads is on
for both tx1 and tx2, then after tx1 returns, tx2 will either see tx1
or fail with an error indicating that the server is currently
unavailable for causal reads. This guarantee is upheld even if there
is a network partition and the standby running tx2 is unable to
communicate with the primary server, but requires the system clocks of
all standbys to differ from the primary's by less than a certain
amount of allowable skew that is accounted for in the algorithm
(causal_reads_timeout / 4, see README.causal_reads for gory details).

It works by sending a stream of "leases" to standbys that are applying
fast enough. These leases grant the standby the right to assume that
all transactions that were run with causal_reads = on and have
returned control have already been applied locally, without doing any
communication or waiting, for a limited time. Leases are promises
made by the primary that it will wait for all such transactions to be
applied on each 'available' standby or for available standbys' leases
to be revoked because they're lagging too much, and for any revoked
leases to expire.

As discussed in the earlier thread, there are other ways that tx2 run
on a standby could get a useful guarantee about the visibility of an
early transaction tx1 that the client knows about. (1) User-managed
"causality tokens": Clients could somehow obtain the LSN of commit
tx1 (or later), and then tx2 could explicitly wait for that LSN to be
applied, as proposed by Ivan Kartyshov[2]/messages/by-id/0240c26c-9f84-30ea-fca9-93ab2df5f305@postgrespro.ru and others; if you aren't
also using sync rep for data loss avoidance, then tx1 will return from
committing without waiting for standbys, and by the time tx2 starts on
a standby it may find that the LSN has already been applied and not
have to wait at all. That is definitely good. Unfortunately it also
transfers the problem of tracking causal dependencies between
transactions to client code, which is a burden on the application
developer and difficult to retrofit. (2) Middleware-managed
causality tokens: Something like pgpool or pgbouncer or some other
proxy could sit in front of all of your PostgreSQL servers and watch
all transactions and do the LSN tracking for you, inserting waits
where appropriate so that no standby query ever sees a snapshot that
doesn't include any commit that any client has heard about; that
requires tx2 to wait for transactions that may be later than tx1 to be
applied potentially slowing down every read query, and requires
pushing all transactions through a single central process thereby
introducing its own failover problem with associated transition
failure mode that could break our guarantee if somehow two of these
proxies are ever active at once.

Don't get me wrong, I think those are good ideas: let's do those too.
I guess that people working on logical multi-master replication might
eventually want a general concept of causality tokens which could
include some kind of vector clock. But I don't see this proposal as
conflicting with any of that. It's a set of trade-offs that provides
a simple solution for users who want to be able to talk directly to
any PostgreSQL standby server out of the box without pushing
everything through a central observer, and who want to be able to
enable this for existing applications without having to rewrite them
to insert complicated code to track and communicate LSNs.

Some assorted thoughts and things I'd love to hear your ideas on:

I admit that it has a potentially confusing relationship with
synchronous replication. It is presented as a separate feature, and
you can use both features together or use them independently:
synchronous_standby_names and synchronous_commit are for controlling
your data loss risk, and causal_reads_standby_names and causal_reads
are for controlling distributed read consistency. Perhaps the
causal_reads GUC should support different levels rather that using
on/off; the mode described above could be enabled with something =
'causal_read_lease', leaving room for other modes. Maybe the whole
feature needs a better name: I borrowed "causal reads" from Galera's
wsrep_causal_reads/wsrep_sync_wait. That system makes readers (think
standbys) wait for the global end of WAL to be applied locally at the
start of every transaction, which could also be a potential future
mode for us, but I thought it was much more interesting to have
wait-free reads on standbys, especially if you already happen to be
waiting on the primary because you want to avoid data loss with
syncrep. To achieve that I added system-clock-based leases. I
suspect some people will dislike that part: the guarantee includes the
caveat about the maximum difference between system clocks, and the
patch doesn't do anything as clever as Google's Spanner/Truetime
system or come with a free built-in atomic clock, so it relies on
setting the max clock skew conservatively and making sure you have NTP
set up correctly (for example, things reportedly got a bit messy for a
short time after the recent leap second if you happened to have only
one server from pool.ntp.org in your ntpd.conf and were unlucky). I
considered ways to make causal reads an extension, but it'd need
fairly invasive hooks including the ability to change replication wire
protocol messages.

Long term, I think it would be pretty cool if we could develop a set
of features that give you distributed sequential consistency on top of
streaming replication. Something like (this | causality-tokens) +
SERIALIZABLE-DEFERRABLE-on-standbys[3]/messages/by-id/CAEepm=2b9TV+vJ4UeSBixDrW7VUiTjxPwWq8K3QwFSWx0pTXHQ@mail.gmail.com +
distributed-dirty-read-prevention[4]/messages/by-id/CAEepm=1GNCriNvWhPkWCqrsbXWGtWEEpvA-KnovMbht5ryzbmg@mail.gmail.com.

The patch:

The replay lag tracking patch this depends on is in the current
commitfest[1]/messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com and is presented as an independent useful feature.
Please find two patches to implement causal reads for the open CF
attached. First apply replay-lag-v16.patch, then
refactor-syncrep-exit-v16.patch, then causal-reads-v16.patch.

Thanks for reading!

[1]: /messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
[2]: /messages/by-id/0240c26c-9f84-30ea-fca9-93ab2df5f305@postgrespro.ru
[3]: /messages/by-id/CAEepm=2b9TV+vJ4UeSBixDrW7VUiTjxPwWq8K3QwFSWx0pTXHQ@mail.gmail.com
[4]: /messages/by-id/CAEepm=1GNCriNvWhPkWCqrsbXWGtWEEpvA-KnovMbht5ryzbmg@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

refactor-syncrep-exit-v16.patchapplication/octet-stream; name=refactor-syncrep-exit-v16.patchDownload

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 9143c47..4ab47a2 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -97,6 +97,8 @@ static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
+static bool SyncRepCheckForEarlyExit(void);
+
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 								 XLogRecPtr *flushPtr,
 								 XLogRecPtr *applyPtr,
@@ -225,57 +227,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgement, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -1088,6 +1042,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgement, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process

causal-reads-v16.patchapplication/octet-stream; name=causal-reads-v16.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b894e31..70f2902 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2888,6 +2888,35 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-causal-reads" xreflabel="causal_reads">
+       <term><varname>causal_reads</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>causal_reads</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>causal_reads</> set to <literal>on</> is guaranteed
+         either to see the effects of all completed transactions run on the
+         primary with the setting on, or to receive an error "standby is not
+         available for causal reads".  Note that both transactions involved in
+         a causal dependency (a write on the primary followed by a read on any
+         server which must see the write) must be run with the setting on.
+         See <xref linkend="causal-reads"> for more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Server(s)</title>
 
@@ -3189,6 +3218,48 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>causal_reads_timeout</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>causal_reads_timeout</> configuration parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the set of standbys available for
+        causal reads.
+       </para>
+       <para>
+        This setting is also used to control the <firstterm>leases</> used to
+        maintain the causal reads guarantee.  It must be set to a value which
+        is at least 4 times the maximum possible difference in system clocks
+        between the primary and standby servers, as described
+        in <xref linkend="causal-reads">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-causal-reads-standby-names" xreflabel="causal-reads-standby-names">
+      <term><varname>causal_reads_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>causal_reads_standby_names</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>causal reads</>, as described in
+        <xref linkend="causal-reads">.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</></>.
+        The default is <literal>*</>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>causal_reads_timeout</> is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index a1a9532..b42a55e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1117,7 +1117,7 @@ primary_slot_name = 'node_a_slot'
     cause each commit to wait until the current synchronous standbys report
     that they have replayed the transaction, making it visible to user
     queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    consistency.  See also <xref linkend="causal-reads">.
    </para>
 
    <para>
@@ -1304,6 +1304,119 @@ synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="causal-reads">
+   <title>Causal reads</title>
+   <indexterm>
+    <primary>causal reads</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The causal reads feature allows read-only queries to run on hot standby
+    servers without exposing stale data to the client, providing a form of
+    causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>causal_reads</> to <literal>on</> in any pair of consecutive
+    transactions tx1, tx2 where tx2 begins after tx1 successfully returns,
+    then tx2 will either see tx1 or fail with a new error "standby is not
+    available for causal reads", no matter which server it runs on.  Although
+    the guarantee is expressed in terms of two individual transactions, the
+    GUC can also be set at session, role or system level to make the guarantee
+    generally, allowing for load balancing of applications that were not
+    designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the feature, <varname>causal_reads_timeout</> must be
+    set to a non-zero value on the primary server.  The
+    GUC <varname>causal_reads_standby_names</> can be used to limit the set of
+    standbys that can join the dynamic set of causal reads standbys by
+    providing a comma-separated list of application names.  By default, all
+    standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    causal reads can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</></>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle causal reads
+    transactions without raising the error.  Since that information is only an
+    instantantaneous snapshot, clients should still be prepared for the error
+    to be raised at any time, and consider redirecting transactions to another
+    standby.
+   </para>
+
+   <para>
+    The advantages of the causal reads feature over simply
+    setting <varname>synchronous_commit</> to <literal>remote_apply</> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the causal reads guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>causal_reads_timeout</>.  For example,
+    if <varname>causal_reads_timeout</> is set to <literal>4s</>, then the
+    clocks must not be further than 1 second apart for the guarantee to be
+    upheld reliably during transitions.  The ubiquity of the Network Time
+    Protocol (NTP) on modern operating systems and availability of high
+    quality time servers makes it possible to choose a tolerance significantly
+    higher than the maximum expected clock difference.  An effort is
+    nevertheless made to detect and report misconfigured and faulty systems
+    with clock differences greater than the configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>causal_reads_timeout</> below 4 seconds (allowing up to
+     1 second of clock difference) after researching your time synchronization
+     infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous replication in the sense that both involve
+      the primary server waiting for responses from standby servers, the
+      causal reads feature is not concerned with avoiding data loss.  A
+      primary configured for causal reads will drop all standbys that stop
+      responding or replay too slowly from the dynamic set that it waits for,
+      so you should consider configuring both synchronous replication and
+      causal reads if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1652,7 +1765,16 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</> with respect to a transaction that
+    has completed on the primary by setting <varname>causal_reads</>
+    to <literal>on</> in both transactions.  For more details,
+    see <xref linkend="causal-reads">.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a422ac0..0006998 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1461,6 +1461,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>causal_reads_state</></entry>
+     <entry><type>text</></entry>
+     <entry>Causal reads state of this standby server.  This field will be
+     non-null only if <varname>cause_reads_timeout</> is set.  If a standby is
+     in <literal>available</> state, then it can currently serve causal reads
+     queries.  If it is not replaying fast enough or not responding to
+     keepalive messages, it will be in <literal>unavailable</> state, and if
+     it is currently transitioning to availability it will be
+     in <literal>joining</> state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604..0789010 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2102,11 +2102,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for causal reads and synchronous replication, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
+	CausalReadsWaitForLSN(recptr);
 	SyncRepWaitForLSN(recptr, true);
 }
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e47fd44..4dde457 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1339,7 +1339,10 @@ RecordTransactionCommit(void)
 	 * in the procarray and continue to hold locks.
 	 */
 	if (wrote_xlog && markXidCommitted)
+	{
+		CausalReadsWaitForLSN(XactLastRecEnd);
 		SyncRepWaitForLSN(XactLastRecEnd, true);
+	}
 
 	/* remember end of last commit record */
 	XactLastCommitEnd = XactLastRecEnd;
@@ -5142,7 +5145,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || causal_reads)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e7312f..3e58bd3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11981,8 +11981,10 @@ StoreXLogTimestampAtLsn(XLogTimestampBuffer *buffer,
  * server.  The timestamp will be sent back to the upstream server via
  * walreceiver when the WAL position is eventually written, flushed and
  * applied.
+ *
+ * Returns true if 'lsn' has already been applied.
  */
-void
+bool
 SetXLogTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn)
 {
 	bool applied_end = false;
@@ -12020,6 +12022,8 @@ SetXLogTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn)
 	}
 
 	SpinLockRelease(&XLogCtl->info_lck);
+
+	return applied_end;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2fd63e3..9321895 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -689,7 +689,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.causal_reads_state
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 61e6a2c..4457fd6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3368,6 +3368,12 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_BGWORKER_STARTUP:
 			event_name = "BgWorkerStartup";
 			break;
+		case WAIT_EVENT_CAUSAL_READS_APPLY:
+			event_name = "CaudalReadsApply";
+			break;
+		case WAIT_EVENT_CAUSAL_READS_REVOKE:
+			event_name = "CaudalReadsRevoke";
+			break;
 		case WAIT_EVENT_EXECUTE_GATHER:
 			event_name = "ExecuteGather";
 			break;
diff --git a/src/backend/replication/README.causal_reads b/src/backend/replication/README.causal_reads
new file mode 100644
index 0000000..1fddd62
--- /dev/null
+++ b/src/backend/replication/README.causal_reads
@@ -0,0 +1,193 @@
+The causal reads guarantee says: If you run any two consecutive
+transactions tx1, tx2 where tx1 completes before tx2 begins, with
+causal_reads set to "on" in both transactions, tx2 will see tx1 or
+raise an error to complain that it can't guarantee causal consistency,
+no matter which servers (primary or any standby) you run each
+transaction on.
+
+When both transactions run on the primary, the guarantee is trivially
+upheld.
+
+To deal with read-only physical streaming standbys, the primary keeps
+track of a set of standbys that it considers to be currently
+"available" for causal reads, and sends a stream of "leases" to those
+standbys granting them the right to handle causal reads transactions
+for a short time without any further communication with the primary.
+
+In general, the primary provides the guarantee by waiting for all of
+the "available" standbys to report that they have applied a
+transaction.  However, the set of available standbys is dynamic, and
+things get more complicated during state transitions.  There are two
+types of transitions to consider:
+
+1.  unavailable->joining->available
+
+Standbys start out as "unavailable".  If a standby is unavailable and
+is applying fast enough and matches causal_reads_standby_names, the
+primary transitions it to "available", but first it sets it to
+"joining" until it is sure that any transaction committed while it was
+unavailable has definitely been applied on the standby.  This closes a
+race that would otherwise exist if we moved directly to available
+state: tx1 might not wait for a given standby because it's
+unavailable, then a lease might be granted, and then tx2 might run a
+causal reads transaction without error but see stale data.  The
+joining state acts as an airlock: while in joining state, the primary
+waits for that standby to replay causal reads transactions in
+anticipation of the move to available, but it doesn't progress to
+available state and grant a lease to the standby until everything
+preceding joining state has also been applied.
+
+2.  available->unavailable
+
+If a standby is not applying fast enough or not responding to
+keepalive messages, then the primary kicks that standby out of the
+dynamic set of available standbys, that is, marks it as "unavailable".
+In order to make sure that the standby has started rejecting causal
+reads transactions, it needs to revoke the lease it most recently
+granted.  It does that by waiting for the lease to expire before
+allowing any causal reads commits to return.  (In future there could
+be a fast-path revocation message which waits for a serial-numbered
+acknowledgement to reduce waiting in the case where the standby is
+lagging but still reachable and responding).
+
+The rest of this document illustrates how clock skew affects the
+available->unavailable transition.
+
+The following 4 variables are derived from a single GUC, and these
+values will be used in the following illustrations:
+
+causal_reads_timeout = 4s
+lease_time           = 4s (= causal_reads_timeout)
+keepalive_time       = 2s (= lease_time / 2)
+max_clock_skew       = 1s (= lease_time / 4)
+
+Every keepalive_time, the primary transmits a lease that expires at
+local_clock_time + lease_time - max_clock_skew, shown in the following
+diagram as 't' for transmission time and '|' for expiry time.  If
+contact is lost with a standby, the primary will wait until sent_time
++ lease_time for the most recently granted lease to expire, shown on
+the following diagram 'x', to be sure that the standby's clock has
+reached the expiry time even if its clock differs by up to
+max_clock_skew.  In other words, the primary tells the standby that
+the expiry time is at one time, but it trusts that the standby will
+surely agree if it gives it some extra time.  The extra time is
+max_clock_skew.  If the clocks differ by more than max_clock_skew, all
+bets are off (but see below for attempt to detect obvious cases).
+
+0     1     2     3     4     5     6     7     8     9
+t-----------------|-----x
+            t-----------------|-----x
+                        t-----------------|-----x
+                                    t-----------------|...
+                                                t------...
+
+A standby whose clock is 2 seconds ahead of the primary's clock
+perceives gaps in the stream of leases, and will reject causal_reads
+transactions in those intervals.  The causal reads guarantee is
+upheld, but spurious errors are raised between leases, as a
+consequence of the clock skew being greater than max_clock_skew.  In
+the following diagram 'r' shows reception time, and the timeline along
+the top shows the standby's local clock time.
+
+2     3     4     5     6     7     8     9    10    11
+r-----|
+            r-----|
+                        r-----|
+                                    r-----|
+                                                r-----|
+
+If there were no network latency, a standby whose clock is exactly 1
+second ahead of the primary's clock would perceive the stream of
+leases as being replaced just in time, so there is no gap.  Since in
+reality the time of receipt is some time after the time of
+transmission due to network latency, if the standby's clock is exactly
+1 second behind, then there will be small network-latency-sized gaps
+before the next lease arrives, but still no correctness problem with
+respect to the causal reads guarantee.
+
+1     2     3     4     5     6     7     8     9    10
+r-----------|
+            r-----------|
+                        r-----------|
+                                    r-----------|
+                                                r------...
+
+A standby whose clock is perfectly in sync with the primary's
+perceives the stream of leases overlapping (this matches the primary's
+perception of the leases it sent):
+
+0     1     2     3     4     5     6     7     8     9
+r-----------------|
+            r-----------------|
+                        r-----------------|
+                                    r-----------------|
+                                                r------...
+
+A standby whose clock is exactly 1 second behind the primary's
+perceives the stream of leases as overlapping even more, but the time
+of expiry as judged by the standby is no later than the time the
+primary will wait for if required ('x').  That is, if contact is lost
+with the standby, the primary can still reliably hold up causal reads
+commits until the standby has started raising the error in
+causal_reads transactions.
+
+-1    0     1     2     3     4     5     6     7     8
+r-----------------------|
+            r-----------------------|
+                        r-----------------------|
+                                    r------------------...
+                                                r------...
+
+
+A standby whose clock is 2 seconds behind the primary's would perceive
+the stream of leases overlapping even more, and the primary would no
+longer be able to wait for a lease to expire if it wanted to revoke
+it.  But because the expiry time is after local_clock_time +
+lease_time, the standby can immediately see that its own clock must be
+more than 1 second behind the primary's, so it ignores the lease and
+logs a clock skew warning.  In the following diagram a lease expiry
+time that is obviously generated by a primary with a clock set too far
+in the future compared to the local clock is shown with a '!'.
+
+-2    -1    0     1     2     3     4     5     6     7
+r-----------------------------!
+            r-----------------------------!
+                        r-----------------------------!
+                                    r------------------...
+                                                r------...
+
+A danger window exists when the standby's clock is more than
+max_clock_skew behind the primary's clock, but not more than
+max_clock_skew + network latency time behind.  If the clock difference
+is in that range, then the algorithm presented above which is based on
+time of receipt cannot detect that the local clock is too far behind.
+The consequence of this problem could be as follows:
+
+1.  The standby loses contact with the primary due to a network fault.
+
+2.  The primary decides to drop the standby from the set of available
+    causal reads standbys due to lack of keepalive responses or
+    excessive lag, which necessitates holding up commits of causal
+    reads transactions until the most recently sent lease expires, in
+    the belief that the standby will definitely have started raising
+    the 'causal reads unavailable' error in causal reads transactions
+    by that time, if it is still alive and servicing requests.
+
+3.  The standby still has clients connected and running queries.
+
+4.  Due to clock skew in the problematic range, in the standby's
+    opinion the lease lasts slightly longer than the primary waits.
+
+5.  For a short window at most the duration of the network latency
+    time, clients running causal reads transactions are allowed to see
+    potentially stale data.
+
+For this reason we say that the causal reads guarantee only holds as
+long as the absolute difference between the system clocks of the
+machines is no more than max_clock_skew.  The theory is that NTP makes
+it possible to reason about the maximum possible clock difference
+between machines and choose a value that allows for a much larger
+difference.  However, we do make a best effort attempt to detect
+wildly divergent systems as described above, to catch the case of
+servers not running a correctly configured ntp daemon, or with a clock
+so far out of whack that ntp refuses to fix it.
\ No newline at end of file
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 4ab47a2..7c96c5b 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -82,6 +82,11 @@
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
 
+/* GUC variables */
+int causal_reads_timeout;
+bool causal_reads;
+char *causal_reads_standby_names;
+
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
 
@@ -95,7 +100,7 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
 
 static bool SyncRepCheckForEarlyExit(void);
 
@@ -127,6 +132,200 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
+ * Check if we can stop waiting for causal consistency.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  Any stall periods caused by standbys dropping out of 'available' state
+ * have passed, so that we can be sure that their leases have expired and they
+ * have started rejecting causal reads transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for to observe any current
+ * commit stall.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting causal_reads transactions.
+ */
+static bool
+CausalReadsCommitCanReturn(XLogRecPtr XactCommitLSN,
+						   int *waitingFor,
+						   long *stallTimeMillis)
+{
+	int i;
+	TimestampTz now;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/*
+		 * Assuming atomic read of pid_t, we can check walsnd->pid without
+		 * acquiring the spinlock to avoid memory synchronization costs for
+		 * unused walsender slots.  We see a value that existed sometime at
+		 * least as recently as the last memory barrier.
+		 */
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0 && walsnd->causal_reads_state >= WALSNDCRSTATE_JOINING)
+			{
+				if (walsnd->apply < XactCommitLSN)
+					++*waitingFor;
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/* Check if there is a stall in progress that we need to observe. */
+	now = GetCurrentTimestamp();
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->stall_causal_reads_until > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->stall_causal_reads_until,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for causal consistency in causal_reads mode, if requested by user.
+ */
+void
+CausalReadsWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	/* Leave if we aren't in causal_reads mode. */
+	if (!causal_reads)
+		return;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any causal reads joining/available
+		 * standby applies XactCommitLSN or the set of causal reads standbys
+		 * changes (if we aren't already in the queue).  We don't actually know
+		 * if we need to wait for any peers to reach the target LSN yet, but
+		 * we have to register just in case before checking the walsenders'
+		 * state to avoid a race condition that could occur if we did it after
+		 * calling CausalReadsCommitCanReturn.  (SyncRepWaitForLSN doesn't
+		 * have to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_CAUSAL_READS);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_CAUSAL_READS));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (CausalReadsCommitCanReturn(XactCommitLSN, &waitingFor, &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_CAUSAL_READS_APPLY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_CAUSAL_READS_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
+/*
  * Wait for synchronous replication, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
@@ -349,6 +548,53 @@ SyncRepInitConfig(void)
 }
 
 /*
+ * Check if the current WALSender process's application_name matches a name in
+ * causal_reads_standby_names (including '*' for wildcard).
+ */
+bool
+CausalReadsPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (causal_reads_timeout == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(causal_reads_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return 0;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
+/*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
@@ -356,7 +602,7 @@ SyncRepInitConfig(void)
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool walsender_cr_available_or_joining)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -370,13 +616,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that causal_reads waits for,
+	 * then we have nothing to do. If we are still starting up, still running
+	 * base backup or the current flush position is still invalid, then leave
+	 * quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!walsender_cr_available_or_joining &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -414,9 +662,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in causal reads 'joining' or
+	 * 'available' state, then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !walsender_cr_available_or_joining)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -425,24 +674,35 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for causal_reads, if this walsender
+	 * manages a standby that is in causal reads 'available' or 'joining'
+	 * state.
+	 */
+	if (walsender_cr_available_or_joining)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_CAUSAL_READS, MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -912,9 +1172,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -931,7 +1190,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -991,7 +1250,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1101,13 +1360,31 @@ SyncRepCheckForEarlyExit(void)
 }
 
 /*
+ * Make sure that CausalReadsWaitForLSN can't return until after the given
+ * lease expiry time has been reached.  In other words, revoke the lease.
+ *
+ * Wake up all backends waiting in CausalReadsWaitForLSN, because the set of
+ * available/joining peers has changed, and there is a new stall time they
+ * need to observe.
+ */
+void
+CausalReadsBeginStall(TimestampTz lease_expiry_time)
+{
+	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+	WalSndCtl->stall_causal_reads_until =
+		Max(WalSndCtl->stall_causal_reads_until, lease_expiry_time);
+	SyncRepWakeQueue(true, SYNC_REP_WAIT_CAUSAL_READS, InvalidXLogRecPtr);
+	LWLockRelease(SyncRepLock);
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
  */
 
 bool
-check_synchronous_standby_names(char **newval, void **extra, GucSource source)
+check_standby_names(char **newval, void **extra, GucSource source)
 {
 	if (*newval != NULL && (*newval)[0] != '\0')
 	{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 621aa24..6ad1c42 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -56,6 +56,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -145,7 +146,8 @@ static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
 static void XLogWalRcvSendReply(bool force, bool requestReply, int timestamps);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static bool ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *causalReadsUntil);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -895,6 +897,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz causalReadsLease;
+	bool		applied_end;
 
 	resetStringInfo(&incoming_message);
 
@@ -915,7 +919,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = IntegerTimestampToTimestampTz(
 										  pq_getmsgint64(&incoming_message));
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -925,7 +929,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -937,12 +941,16 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				sendTime = IntegerTimestampToTimestampTz(
 										  pq_getmsgint64(&incoming_message));
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				causalReadsLease = IntegerTimestampToTimestampTz(
+					pq_getmsgint64(&incoming_message));
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				applied_end = ProcessWalSndrMessage(walEnd, sendTime,
+													&causalReadsLease);
 
-				/* If the primary requested a reply, send one immediately */
-				if (replyRequested)
+				/* If the primary requested a reply, send one immediately. */
+				if (replyRequested || applied_end)
 					XLogWalRcvSendReply(true, false, 0);
+
 				break;
 			}
 		default:
@@ -1323,13 +1331,54 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'causalReadsLease' is a pointer to
+ * the time the primary promises that this standby can safely claim to be
+ * causally consistent, to 0 if it cannot, or a NULL pointer for no change.
+ *
+ * Returns true if this standby has already replayed 'walEnd'.
  */
-static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+static bool
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *causalReadsLease)
 {
 	WalRcvData *walrcv = WalRcv;
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
+	bool applied_end = false;
+
+	/* Sanity check for the causalReadsLease time. */
+	if (causalReadsLease != NULL && *causalReadsLease != 0)
+	{
+		/* Deduce max_clock_skew from the causalReadsLease and sendTime. */
+#ifdef HAVE_INT64_TIMESTAMP
+		int64 diffMillis = (*causalReadsLease - sendTime) / 1000;
+#else
+		int64 diffMillis = (*causalReadsLease - sendTime) * 1000;
+#endif
+		int64 max_clock_skew = diffMillis / (CAUSAL_READS_CLOCK_SKEW_RATIO - 1);
+
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			elog(LOG, "the primary server's clock time is too far ahead");
+			causalReadsLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for causal reads' errors
+		 * which should cause the user to investigate.
+		 */
+	}
 
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
@@ -1338,6 +1387,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (causalReadsLease != NULL)
+		walrcv->causalReadsLease = *causalReadsLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	/*
@@ -1348,7 +1399,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	 * purposes.
 	 */
 	if (replication_lag_sample_interval != -1)
-		SetXLogTimestampAtLsn(sendTime, walEnd);
+		applied_end = SetXLogTimestampAtLsn(sendTime, walEnd);
 
 	if (log_min_messages <= DEBUG2)
 	{
@@ -1377,6 +1428,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 		pfree(sendtime);
 		pfree(receipttime);
 	}
+
+	return applied_end;
 }
 
 /*
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 01111a4..f09f81d 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -28,6 +28,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -374,3 +375,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for causal reads.
+ */
+bool
+WalRcvCausalReadsAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->causalReadsLease != 0 && now <= walrcv->causalReadsLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3fbca0c..d26c950 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -155,9 +155,20 @@ static StringInfoData tmpbuf;
  */
 static TimestampTz last_reply_timestamp = 0;
 
+static TimestampTz last_keepalive_timestamp = 0;
+
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* How long do need to stay in JOINING state? */
+static XLogRecPtr causal_reads_joining_until = 0;
+
+/* The last causal reads lease sent to the standby. */
+static TimestampTz causal_reads_last_lease = 0;
+
+/* Is this WALSender listed in causal_reads_standby_names? */
+static bool am_potential_causal_reads_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -243,6 +254,57 @@ InitWalSender(void)
 	SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE);
 }
 
+ /*
+ * If we are exiting unexpectedly, we may need to communicate with concurrent
+ * causal_reads commits to maintain the causal consistency guarantee.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->causal_reads_state == WALSNDCRSTATE_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any causal_reads transactions return until we've stalled
+		 * for long enough for a zombie standby to start raising errors
+		 * because its lease has expired.
+		 */
+		elog(LOG, "standby \"%s\" is lost (no longer available for causal reads)", application_name);
+		CausalReadsBeginStall(causal_reads_last_lease);
+
+		/*
+		 * We set the state to a lower level _after_ beginning the stall,
+		 * otherwise there would be a tiny window where commits could return
+		 * without observing the stall.
+		 */
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->causal_reads_state = WALSNDCRSTATE_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->causal_reads_state == WALSNDCRSTATE_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it, and no
+		 * stall is necessary.
+		 */
+		elog(LOG, "standby \"%s\" is leaving (no longer available for causal reads)", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->causal_reads_state = WALSNDCRSTATE_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
 /*
  * Clean up after an error.
  *
@@ -270,7 +332,10 @@ WalSndErrorCleanup(void)
 
 	replication_active = false;
 	if (walsender_ready_to_stop)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -282,6 +347,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1396,6 +1463,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1412,6 +1480,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1461,6 +1530,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1612,6 +1682,88 @@ ProcessStandbyReplyMessage(void)
 	 */
 	{
 		WalSnd	   *walsnd = MyWalSnd;
+		WalSndCausalReadsState causal_reads_state = walsnd->causal_reads_state;
+		bool causal_reads_state_changed = false;
+		bool causal_reads_set_joining_until = false;
+
+		/*
+		 * Handle causal reads state transitions, if a causal_reads_timeout is
+		 * configured, this standby is listed in causal_reads_standby_names,
+		 * and we are a primary database (not a cascading standby).
+		 */
+		if (am_potential_causal_reads_standby && !am_cascading_walsender)
+		{
+			if ((walsnd->applyLagUs != -1 && applyPtr == GetFlushRecPtr()) ||
+				(applyLagUs >= 0 && applyLagUs / 1000 < causal_reads_timeout))
+			{
+				/*
+				 * Either the standby has replayed completely (but is
+				 * definitely configured to send replication samples), or it
+				 * hasn't replayed completely and its lag time is acceptable.
+				 */
+				if (causal_reads_state == WALSNDCRSTATE_UNAVAILABLE)
+				{
+					/*
+					 * The standby is applying fast enough.  We can't grant a
+					 * lease yet though, we need to wait for everything that
+					 * was committed while this standby was unavailable to be
+					 * applied first.  We move to joining state while we wait
+					 * for the standby to catch up.
+					 */
+					causal_reads_state = WALSNDCRSTATE_JOINING;
+					causal_reads_set_joining_until = true;
+					causal_reads_state_changed = true;
+				}
+				else if (causal_reads_state == WALSNDCRSTATE_JOINING &&
+						 applyPtr >= causal_reads_joining_until)
+				{
+					/*
+					 * The standby has applied everything committed before we
+					 * reached joining state, and has been waiting for remote
+					 * apply on this standby while it's been in joining state,
+					 * so it is safe to move to available state and send a
+					 * lease.
+					 */
+					causal_reads_state = WALSNDCRSTATE_AVAILABLE;
+					causal_reads_state_changed = true;
+				}
+			}
+			else if (applyLagUs >= 0)
+			{
+				/* Not replaying fast enough. */
+				if (causal_reads_state == WALSNDCRSTATE_AVAILABLE)
+				{
+					causal_reads_state = WALSNDCRSTATE_UNAVAILABLE;
+					causal_reads_state_changed = true;
+					/*
+					 * We are dropping a causal reads available standby, so we
+					 * mustn't let any commit command that is waiting in
+					 * CausalReadsWaitForLSN return until we are sure that the
+					 * standby definitely knows that it's not available and
+					 * starts raising errors for causal_reads transactions.
+					 * TODO: We could just wait until the standby acks that
+					 * its lease has been revoked, and start numbering
+					 * keepalives and sending the number back in replies, so
+					 * we know it's acking the right message; then lagging
+					 * standbys would be less disruptive, but for now we just
+					 * wait for the lease to expire, as we do when we lose
+					 * contact with a standby, for the sake of simplicity.
+					 */
+					CausalReadsBeginStall(causal_reads_last_lease);
+				}
+				else if (causal_reads_state == WALSNDCRSTATE_JOINING)
+				{
+					/*
+					 * Dropping a joining standby doesn't require a stall,
+					 * because the standby doesn't think it's available, so
+					 * it's already raising the error for causal_reads
+					 * transactions.
+					 */
+					causal_reads_state = WALSNDCRSTATE_UNAVAILABLE;
+					causal_reads_state_changed = true;
+				}
+			}
+		}
 
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
@@ -1623,11 +1775,33 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLagUs = flushLagUs;
 		if (applyLagUs >= 0)
 			walsnd->applyLagUs = applyLagUs;
+		walsnd->causal_reads_state = causal_reads_state;
 		SpinLockRelease(&walsnd->mutex);
+
+		if (causal_reads_set_joining_until)
+		{
+			/*
+			 * Record the end of the primary's WAL at some arbitrary point
+			 * observed _after_ we moved to joining state (so that causal
+			 * reads commits start waiting, closing a race).  The standby
+			 * won't become available until it has replayed up to here.
+			 */
+			causal_reads_joining_until = GetFlushRecPtr();
+		}
+
+		if (causal_reads_state_changed)
+		{
+			WalSndKeepalive(true);
+			elog(LOG, "standby \"%s\" is %s", application_name,
+				 causal_reads_state == WALSNDCRSTATE_UNAVAILABLE ? "unavailable for causal reads" :
+				 causal_reads_state == WALSNDCRSTATE_JOINING ? "joining as a causal reads standby..." :
+				 causal_reads_state == WALSNDCRSTATE_AVAILABLE ? "available for causal reads" :
+				 "UNKNOWN");
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->causal_reads_state >= WALSNDCRSTATE_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -1768,33 +1942,53 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if causal_reads_timeout is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;		/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_causal_reads_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_causal_reads_standby)
+		{
+			/*
+			 * Leases last for a period of between 50% and 100% of
+			 * causal_reads_timeout, depending on clock skew, assuming clock
+			 * skew is under the 25% of causal_reads_timeout.  We send new
+			 * leases every half a lease, so that there are no gaps between
+			 * leases.
+			 */
+			wakeup_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+													  causal_reads_timeout /
+													  CAUSAL_READS_KEEPALIVE_RATIO);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -1810,20 +2004,33 @@ WalSndComputeSleeptime(TimestampTz now)
 /*
  * Check whether there have been responses by the client within
  * wal_sender_timeout and shutdown if not.
+ *
+ * If causal_reads_timeout is configured we override that, so that
+ * unresponsive standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a causal_reads_timeout is configured, it is used instead of
+	 * wal_sender_timeout, to limit the time before an unresponsive causal
+	 * reads standby is dropped.
+	 */
+	if (am_potential_causal_reads_standby)
+		allowed_time = causal_reads_timeout;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && now >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && now >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -1859,6 +2066,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	/* Report to pgstat that this process is a WAL sender */
 	pgstat_report_activity(STATE_RUNNING, "walsender");
 
+	/* Check if we are managing potential causal_reads standby. */
+	am_potential_causal_reads_standby = CausalReadsPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2023,6 +2233,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLagUs = -1;
 			walsnd->applyLagUs = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->causal_reads_state = WALSNDCRSTATE_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -2796,6 +3007,25 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the causal reads state. This is used
+ * in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetCausalReadsStateString(WalSndCausalReadsState causal_reads_state)
+{
+	switch (causal_reads_state)
+	{
+	case WALSNDCRSTATE_UNAVAILABLE:
+		return "unavailable";
+	case WALSNDCRSTATE_JOINING:
+		return "joining";
+	case WALSNDCRSTATE_AVAILABLE:
+		return "available";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 lag_as_interval(uint64 lag_us)
 {
@@ -2819,7 +3049,7 @@ lag_as_interval(uint64 lag_us)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -2872,6 +3102,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int64		applyLagUs;
 		int			priority;
 		WalSndState state;
+		WalSndCausalReadsState causalReadsState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -2881,6 +3112,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		SpinLockAcquire(&walsnd->mutex);
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		causalReadsState = walsnd->causal_reads_state;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -2963,6 +3195,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetCausalReadsStateString(causalReadsState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -2982,14 +3217,52 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 static void
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz now;
+	TimestampTz causal_reads_lease;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/*
+	 * If the walsender currently deems the standby to be available for causal
+	 * reads, then it grants a causal reads lease.  The lease authorizes the
+	 * standby to consider itself available for causal reads until a short
+	 * time in the future.  The primary promises to uphold the causal reads
+	 * guarantee until that time, by stalling commits until the the lease has
+	 * expired if necessary.
+	 */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->causal_reads_state < WALSNDCRSTATE_AVAILABLE)
+		causal_reads_lease = 0; /* Not available, no lease granted. */
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  First, we decide on a maximum tolerable
+		 * difference between system clocks.  If the primary's clock is ahead
+		 * of the standby's by more than this, then all bets are off (the
+		 * standby could falsely believe it has a valid lease).  If the
+		 * primary's clock is behind the standby's by more than this, then the
+		 * standby will err the other way and generate spurious errors in
+		 * causal_reads mode.  Rather than having a separate GUC for this, we
+		 * derive it from causal_reads_timeout.
+		 */
+		int max_clock_skew = causal_reads_timeout / CAUSAL_READS_CLOCK_SKEW_RATIO;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		causal_reads_last_lease = TimestampTzPlusMilliseconds(now, causal_reads_timeout);
+		/* The version we'll send to the standby is adjusted to tolerate clock skew. */
+		causal_reads_lease =
+			TimestampTzPlusMilliseconds(causal_reads_last_lease, -max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentIntegerTimestamp());
+	pq_sendint64(&output_message, TimestampTzToIntegerTimestamp(now));
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, TimestampTzToIntegerTimestamp(causal_reads_lease));
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
@@ -3007,23 +3280,35 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_causal_reads_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If causal_reads_timeout has been configured, use it to control
+	 * keepalive intervals rather than wal_sender_timeout, so that we can keep
+	 * replacing leases at the right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_causal_reads_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+												causal_reads_timeout /
+												CAUSAL_READS_KEEPALIVE_RATIO);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (now >= ping_time)
 	{
 		WalSndKeepalive(true);
 		waiting_for_ping_response = true;
+		last_keepalive_timestamp = now;
 
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index e7bdb92..8f6331f 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -306,6 +306,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_CAUSAL_READS_NOT_AVAILABLE                         causal_reads_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1adb598..9c206b9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1634,6 +1634,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"causal_reads", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables causal reads."),
+		 NULL
+		},
+		&causal_reads,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
 			NULL
@@ -1822,6 +1832,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"causal_reads_timeout", PGC_SIGHUP, REPLICATION_STANDBY,
+			gettext_noop("Sets the maximum apply lag before causal reads standbys are no longer available."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&causal_reads_timeout,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"max_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS,
 			gettext_noop("Sets the maximum number of concurrent connections."),
 			NULL
@@ -3504,7 +3525,18 @@ static struct config_string ConfigureNamesString[] =
 		},
 		&SyncRepStandbyNames,
 		"",
-		check_synchronous_standby_names, assign_synchronous_standby_names, NULL
+		check_standby_names, assign_synchronous_standby_names, NULL
+	},
+
+	{
+		{"causal_reads_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential causal reads standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&causal_reads_standby_names,
+		"*",
+		check_standby_names, NULL, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f703e25..c799cb7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -250,6 +250,15 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#causal_reads_timeout = 0s      # maximum replication delay to tolerate from
+                                # standbys before dropping them from the set of
+				# available causal reads peers; 0 to disable
+				# causal reads
+
+#causal_reads_standy_names = '*'
+                                # standby servers that can potentially become
+				# available for causal reads; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -274,6 +283,14 @@
 #replication_lag_sample_interval = 1s	# min time between timestamps recorded
 					# to estimate lag; -1 disables lag sampling
 
+# - All Servers -
+
+#causal_reads = off                     # "on" in any pair of consecutive
+                                        # transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for causal reads
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6cf3829..565af92 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -328,6 +330,16 @@ GetTransactionSnapshot(void)
 				 "cannot take query snapshot during a parallel operation");
 
 		/*
+		 * In causal_reads mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (causal_reads && RecoveryInProgress() && !WalRcvCausalReadsAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_CAUSAL_READS_NOT_AVAILABLE),
+					 errmsg("standby is not available for causal reads")));
+
+		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
 		 * make a copy of it rather than returning CurrentSnapshotData
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ee11cf5..74ac404 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -246,7 +246,7 @@ extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
-extern void SetXLogTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn);
+extern bool SetXLogTimestampAtLsn(TimestampTz timestamp, XLogRecPtr lsn);
 extern bool CheckForWrittenTimestampedLsn(XLogRecPtr lsn,
 										  TimestampTz *timestamp);
 extern bool CheckForFlushedTimestampedLsn(XLogRecPtr lsn,
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 80267b4..a14bd50 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2768,7 +2768,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_location,write_location,flush_location,replay_location,write_lag,flush_lag,replay_lag,sync_priority,sync_state,causal_reads_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 282f8ae..05d5b08 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -778,6 +778,8 @@ typedef enum
 {
 	WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
 	WAIT_EVENT_BGWORKER_STARTUP,
+	WAIT_EVENT_CAUSAL_READS_APPLY,
+	WAIT_EVENT_CAUSAL_READS_REVOKE,
 	WAIT_EVENT_EXECUTE_GATHER,
 	WAIT_EVENT_MQ_INTERNAL,
 	WAIT_EVENT_MQ_PUT_MESSAGE,
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 9614b31..99ef4ca 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_CAUSAL_READS	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -37,6 +39,24 @@
 #define SYNC_REP_QUORUM		1
 
 /*
+ * ratio of causal_read_timeout to max_clock_skew (4 means than the maximum
+ * tolerated clock difference between primary and standbys using causal_reads
+ * is 1/4 of causal_reads_timeout)
+ */
+#define CAUSAL_READS_CLOCK_SKEW_RATIO 4
+
+/*
+ * ratio of causal_reads_timeout to keepalive time (2 means that the effective
+ * keepalive time is 1/2 of the causal_reads_timeout GUC when it is non-zero)
+ */
+#define CAUSAL_READS_KEEPALIVE_RATIO 2
+
+/* GUC variables */
+extern int causal_reads_timeout;
+extern bool causal_reads;
+extern char *causal_reads_standby_names;
+
+/*
  * Struct for the configuration of synchronous replication.
  *
  * Note: this must be a flat representation that can be held in a single
@@ -71,7 +91,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +99,15 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by user backend (xact.c) */
+extern void CausalReadsWaitForLSN(XLogRecPtr XactCommitLSN);
+
+/* called by wal sender */
+extern void CausalReadsBeginStall(TimestampTz lease_expiry_time);
+extern bool CausalReadsPotentialStandby(void);
+
 /* GUC infrastructure */
-extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 41b248f..0c828e1 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -81,6 +81,13 @@ typedef struct
 	TimeLineID	receivedTLI;
 
 	/*
+	 * causalReadsLease is the time until which the primary has authorized
+	 * this standby to consider itself available for causal_reads mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz causalReadsLease;
+
+	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
 	 * receivedUpto before the last flush to disk.  Startup process can use
@@ -214,4 +221,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(bool sendApplyTimestamp);
 
+extern bool WalRcvCausalReadsAvailable(void);
+
 #endif   /* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index fb3a03f..13fd294 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -27,6 +27,13 @@ typedef enum WalSndState
 	WALSNDSTATE_STREAMING
 } WalSndState;
 
+typedef enum WalSndCausalReadsState
+{
+	WALSNDCRSTATE_UNAVAILABLE = 0,
+	WALSNDCRSTATE_JOINING,
+	WALSNDCRSTATE_AVAILABLE
+} WalSndCausalReadsState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  */
@@ -34,6 +41,7 @@ typedef struct WalSnd
 {
 	pid_t		pid;			/* this walsender's process id, or 0 */
 	WalSndState state;			/* this walsender's state */
+	WalSndCausalReadsState causal_reads_state; /* the walsender's causal reads state */
 	XLogRecPtr	sentPtr;		/* WAL has been sent up to this point */
 	bool		needreload;		/* does currently-open file need to be
 								 * reloaded? */
@@ -91,6 +99,12 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in causal_reads stall?  This is used to wait
+	 * for causal reads leases to expire.
+	 */
+	TimestampTz	stall_causal_reads_until;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;

Dmitry Dolgov

9erthalion6@gmail.com

over 8 years ago

In reply to: Thomas Munro (#1)

Re: Causal reads take II

I'm wondering about status of this patch and how can I try it out?

On 3 January 2017 at 02:43, Thomas Munro <thomas.munro@enterprisedb.com>

wrote:

The replay lag tracking patch this depends on is in the current commitfest

I assume you're talking about this patch [1]https://commitfest.postgresql.org/12/920/ (at least it's the only thread
where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
returned with feedback, so what's the status of this one (`causal reads`)?

First apply replay-lag-v16.patch, then refactor-syncrep-exit-v16.patch,

then

causal-reads-v16.patch.

It would be nice to have all three of them attached (for some reason I see
only
last two of them in this thread). But anyway there are a lot of failed hunks
when I'm trying to apply `replay-lag-v16.patch` and
`refactor-syncrep-exit-v16.patch`,
`causal-reads-v16.patch` (or last two of them separately).

[1]: https://commitfest.postgresql.org/12/920/

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Dmitry Dolgov (#2)

Re: Causal reads take II

On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I'm wondering about status of this patch and how can I try it out?

Hi Dmitry, thanks for your interest.

On 3 January 2017 at 02:43, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
The replay lag tracking patch this depends on is in the current commitfest

I assume you're talking about this patch [1] (at least it's the only thread
where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
returned with feedback, so what's the status of this one (`causal reads`)?

Right, replay lag tracking was committed. I'll post a rebased causal
reads patch later today.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Thomas Munro (#3)

Re: Causal reads take II

On Mon, May 22, 2017 at 6:32 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I'm wondering about status of this patch and how can I try it out?

Hi Dmitry, thanks for your interest.

On 3 January 2017 at 02:43, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
The replay lag tracking patch this depends on is in the current commitfest

I assume you're talking about this patch [1] (at least it's the only thread
where I could find a `replay-lag-v16.patch`)? But `replay lag tracking` was
returned with feedback, so what's the status of this one (`causal reads`)?

Right, replay lag tracking was committed. I'll post a rebased causal
reads patch later today.

I ran into a problem while doing this, and it may take a couple more
days to fix it since I am at pgcon this week. More soon.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Thomas Munro (#4)

1 attachment(s)

Re: Causal reads take II

On Wed, May 24, 2017 at 3:58 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I'm wondering about status of this patch and how can I try it out?

I ran into a problem while doing this, and it may take a couple more
days to fix it since I am at pgcon this week. More soon.

Apologies for the extended delay. Here is the rebased patch, now with
a couple of improvements (see below). To recap, this is the third
part of the original patch series[1]/messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com, which had these components:

1. synchronous_commit = remote_apply, committed in PostgreSQL 9.6
2. replication lag tracking, committed in PostgreSQL 10
3. causal_reads, the remaining part, hereby proposed for PostgreSQL 11

The goal is to allow applications to move arbitrary read-only
transactions to physical replica databases and still know that they
can see all preceding write transactions or get an error. It's
something like regular synchronous replication with synchronous_commit
= remote_apply, except that it limits the impact on the primary and
handles failure transitions with defined semantics.

The inspiration for this kind of distributed read-follows-write
consistency using read leases was a system called Comdb2[2]https://github.com/bloomberg/comdb2[3]http://www.vldb.org/pvldb/vol9/p1377-scotti.pdf, whose
designer encouraged me to try to extend Postgres's streaming
replication to do something similar. Read leases can also be found in
some consensus systems like Google Megastore, albeit in more ambitious
form IIUC. The name is inspired by a MySQL Galera feature
(approximately the same feature but the approach is completely
different; Galera adds read latency, whereas this patch does not).
Maybe it needs a better name.

Is this is a feature that people want to see in PostgreSQL?

IMPROVEMENTS IN V17

The GUC to enable the feature is now called
"causal_reads_max_replay_lag". Standbys listed in
causal_reads_standby_names whose pg_stat_replication.replay_lag
doesn't exceed that time are "available" for causal reads and will be
waited for by the primary when committing. When they exceed that
threshold they are briefly in "revoking" state and then "unavailable",
and when the go return to an acceptable level they are briefly in
"joining" state before reaching "available". CR states appear in
pg_stat_replication and transitions are logged at LOG level.

A new GUC called "causal_reads_lease_time" controls the lifetime of
read leases sent from the primary to the standby. This affects the
frequency of lease replacement messages, and more importantly affects
the worst case of commit stall that can be introduced if connectivity
to a standby is lost and we have to wait for the last sent lease to
expire. In the previous version, one single GUC controlled both
maximum tolerated replay lag and lease lifetime, which was good from
the point of view that fewer GUCs are better, but bad because it had
to be set fairly high when doing both jobs to be conservative about
clock skew. The lease lifetime must be at least 4 x maximum tolerable
clock skew. After the recent botching of a leap-second transition on
a popular public NTP network (TL;DR OpenNTP is not a good choice of
implementation to add to a public time server pool) I came to the
conclusion that I wouldn't want to recommend a default max clock skew
under 1.25s, to allow for some servers to be confused about leap
seconds for a while or to be running different smearing algorithms. A
reasonable causal_reads_lease_time recommendation for people who don't
know much about the quality of their time source might therefore be
5s. I think it's reasonable to want to set the maximum tolerable
replay lag to lower time than that, or in fact as low as you like,
depending on your workload and hardware. Therefore I decided to split
the old "causal_reads_timeout" GUC into "causal_reads_max_replay_lag"
and "causal_reads_lease_time".

This new version introduces fast lease revocation. Whenever the
primary decides that a standby is not keeping up, it kicks it out of
the set of CR-available standbys and revokes its lease, so that anyone
trying to run causal reads transactions there will start receiving a
new error. In the previous version, it always did that by blocking
commits while waiting for the most recently sent lease to expire,
which I now call "slow revocation" because it could take several
seconds. Now it blocks commits only until the standby acknowledges
that it is no longer available for causal reads OR the lease expires:
ideally that takes the time of a network a round trip. Slow
revocation is still needed in various failure cases such as lost
connectivity.

TESTING

Apply the patch after first applying a small bug fix for replication
lag tracking[4]/messages/by-id/CAEepm=3tJX_0kSeDi8OYTMp8NogrqPxgP1+2uzsdePz9i0-V0Q@mail.gmail.com. Then:

1. Set up some streaming replicas.
2. Stick causal_reads_max_replay_lag = 2s (or any time you like) in
the primary's postgresql.conf.
3. Set causal_reads = on in some transactions on various nodes.
4. Try to break it!

As long as your system clocks don't disagree by more than 1.25s
(causal_reads_lease_time / 4), the causal reads guarantee will be
upheld: standbys will either see transactions that have completed on
the primary or raise an error to indicate that they are not available
for causal reads transactions. You should not be able to break this
guarantee, no matter what you do: unplug the network, kill arbitrary
processes, etc.

If you mess with your system clocks so they differ by more than
causal_reads_lease_time / 4, you should see that a reasonable effort
is made to detect that so it's still very unlikely you can break it
(you'd need clocks to differ by more than causal_reads_lease_time / 4
but less than causal_reads_lease_time / 4 + network latency so that
the excessive skew is not detected, and then you'd need a very well
timed pair of transactions and loss of connectivity).

[1]: /messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
[2]: https://github.com/bloomberg/comdb2
[3]: http://www.vldb.org/pvldb/vol9/p1377-scotti.pdf
[4]: /messages/by-id/CAEepm=3tJX_0kSeDi8OYTMp8NogrqPxgP1+2uzsdePz9i0-V0Q@mail.gmail.com

Attachments:

causal-reads-v17.patchapplication/octet-stream; name=causal-reads-v17.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3aca6479b1f..1845f6250c6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2904,6 +2904,35 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-causal-reads" xreflabel="causal_reads">
+       <term><varname>causal_reads</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>causal_reads</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>causal_reads</> set to <literal>on</> is guaranteed
+         either to see the effects of all completed transactions run on the
+         primary with the setting on, or to receive an error "standby is not
+         available for causal reads".  Note that both transactions involved in
+         a causal dependency (a write on the primary followed by a read on any
+         server which must see the write) must be run with the setting on.
+         See <xref linkend="causal-reads"> for more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Server(s)</title>
 
@@ -3205,6 +3234,65 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>causal_reads_max_replay_lag</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>causal_reads_max_replay_lag</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the set of standbys available for
+        causal reads.
+       </para>
+       <para>
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="causal-reads">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>causal_reads_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>causal_reads_lease_time</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run causal reads queries for a
+        limited time.  This affects the rate at which replacement leases must
+        be sent and the wait time if contact is lost with a primary, as
+        described in <xref linkend="causal-reads">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-causal-reads-standby-names" xreflabel="causal-reads-standby-names">
+      <term><varname>causal_reads_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>causal_reads_standby_names</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>causal reads</>, as described in
+        <xref linkend="causal-reads">.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</></>.
+        The default is <literal>*</>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>causal_reads_timeout</> is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 72eb073621f..ff2f14a5c38 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1115,7 +1115,7 @@ primary_slot_name = 'node_a_slot'
     cause each commit to wait until the current synchronous standbys report
     that they have replayed the transaction, making it visible to user
     queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    consistency.  See also <xref linkend="causal-reads">.
    </para>
 
    <para>
@@ -1313,6 +1313,119 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="causal-reads">
+   <title>Causal reads</title>
+   <indexterm>
+    <primary>causal reads</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The causal reads feature allows read-only queries to run on hot standby
+    servers without exposing stale data to the client, providing a form of
+    causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>causal_reads</> to <literal>on</> in any pair of consecutive
+    transactions tx1, tx2 where tx2 begins after tx1 successfully returns,
+    then tx2 will either see tx1 or fail with a new error "standby is not
+    available for causal reads", no matter which server it runs on.  Although
+    the guarantee is expressed in terms of two individual transactions, the
+    GUC can also be set at session, role or system level to make the guarantee
+    generally, allowing for load balancing of applications that were not
+    designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the feature, <varname>causal_reads_max_replay_lag</>
+    must be set to a non-zero value on the primary server.  The
+    GUC <varname>causal_reads_standby_names</> can be used to limit the set of
+    standbys that can join the dynamic set of causal reads standbys by
+    providing a comma-separated list of application names.  By default, all
+    standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    causal reads can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</></>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle causal reads
+    transactions without raising the error.  Since that information is only an
+    instantantaneous snapshot, clients should still be prepared for the error
+    to be raised at any time, and consider redirecting transactions to another
+    standby.
+   </para>
+
+   <para>
+    The advantages of the causal reads feature over simply
+    setting <varname>synchronous_commit</> to <literal>remote_apply</> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the causal reads guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>causal_reads_lease_time</>.  For example,
+    if <varname>causal_reads_lease_time</> is set to <literal>5s</>, then the
+    clocks must not be further than 1.25 second apart for the guarantee to be
+    upheld reliably during transitions.  The ubiquity of the Network Time
+    Protocol (NTP) on modern operating systems and availability of high
+    quality time servers makes it possible to choose a tolerance significantly
+    higher than the maximum expected clock difference.  An effort is
+    nevertheless made to detect and report misconfigured and faulty systems
+    with clock differences greater than the configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>causal_reads_lease_time</> below the default of 5
+     seconds (allowing up to 1.25 second of clock difference) after
+     researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous replication in the sense that both involve
+      the primary server waiting for responses from standby servers, the
+      causal reads feature is not concerned with avoiding data loss.  A
+      primary configured for causal reads will drop all standbys that stop
+      responding or replay too slowly from the dynamic set that it waits for,
+      so you should consider configuring both synchronous replication and
+      causal reads if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1661,7 +1774,16 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</> with respect to a transaction that
+    has completed on the primary by setting <varname>causal_reads</>
+    to <literal>on</> in both transactions.  For more details,
+    see <xref linkend="causal-reads">.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index be3dc672bcc..515064e8764 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1790,6 +1790,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>causal_reads_state</></entry>
+     <entry><type>text</></entry>
+     <entry>Causal reads state of this standby server.  This field will be
+     non-null only if <varname>cause_reads_timeout</> is set.  If a standby is
+     in <literal>available</> state, then it can currently serve causal reads
+     queries.  If it is not replaying fast enough or not responding to
+     keepalive messages, it will be in <literal>unavailable</> state, and if
+     it is currently transitioning to availability it will be
+     in <literal>joining</> state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ba03d9687e5..1440b399bda 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2234,11 +2234,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for causal reads and synchronous replication, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
+	CausalReadsWaitForLSN(recptr);
 	SyncRepWaitForLSN(recptr, true);
 }
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b0aa69fe4b4..5aae6908647 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1342,7 +1342,10 @@ RecordTransactionCommit(void)
 	 * in the procarray and continue to hold locks.
 	 */
 	if (wrote_xlog && markXidCommitted)
+	{
+		CausalReadsWaitForLSN(XactLastRecEnd);
 		SyncRepWaitForLSN(XactLastRecEnd, true);
+	}
 
 	/* remember end of last commit record */
 	XactLastCommitEnd = XactLastRecEnd;
@@ -5149,7 +5152,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || causal_reads)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0fdad0c1197..f037f0fe349 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -732,7 +732,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.causal_reads_state
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 65b7b328f1f..c9eb152892a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3576,6 +3576,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_BTREE_PAGE:
 			event_name = "BtreePage";
 			break;
+		case WAIT_EVENT_CAUSAL_READS_APPLY:
+			event_name = "CausalReadsApply";
+			break;
 		case WAIT_EVENT_EXECUTE_GATHER:
 			event_name = "ExecuteGather";
 			break;
@@ -3634,6 +3637,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_BASE_BACKUP_THROTTLE:
 			event_name = "BaseBackupThrottle";
 			break;
+		case WAIT_EVENT_CAUSAL_READS_REVOKE:
+			event_name = "CausalReadsRevoke";
+			break;
 		case WAIT_EVENT_PG_SLEEP:
 			event_name = "PgSleep";
 			break;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 898c497d12c..3eb79a0fd2b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1295,6 +1295,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 5fd47689dd2..25e56397eb0 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int causal_reads_max_replay_lag;
+int causal_reads_lease_time;
+bool causal_reads;
+char *causal_reads_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -129,6 +138,227 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
+ * Check if we can stop waiting for causal consistency.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting causal reads
+ * transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting causal_reads transactions.
+ */
+static bool
+CausalReadsCommitCanReturn(XLogRecPtr XactCommitLSN,
+						   int *waitingFor,
+						   long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->causalReadsState)
+				{
+				case WALSNDCRSTATE_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case WALSNDCRSTATE_JOINING:
+				case WALSNDCRSTATE_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case WALSNDCRSTATE_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * causal reads transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for causal consistency in causal_reads mode, if requested by user.
+ */
+void
+CausalReadsWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	/* Leave if we aren't in causal_reads mode. */
+	if (!causal_reads)
+		return;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any causal reads joining/available
+		 * standby applies XactCommitLSN or the set of causal reads standbys
+		 * changes (if we aren't already in the queue).  We don't actually know
+		 * if we need to wait for any peers to reach the target LSN yet, but
+		 * we have to register just in case before checking the walsenders'
+		 * state to avoid a race condition that could occur if we did it after
+		 * calling CausalReadsCommitCanReturn.  (SyncRepWaitForLSN doesn't
+		 * have to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_CAUSAL_READS);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_CAUSAL_READS));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (CausalReadsCommitCanReturn(XactCommitLSN, &waitingFor, &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_CAUSAL_READS_APPLY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_CAUSAL_READS_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
+/*
  * Wait for synchronous replication, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
@@ -229,57 +459,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgement, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -399,6 +581,53 @@ SyncRepInitConfig(void)
 }
 
 /*
+ * Check if the current WALSender process's application_name matches a name in
+ * causal_reads_standby_names (including '*' for wildcard).
+ */
+bool
+CausalReadsPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (causal_reads_max_replay_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(causal_reads_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return 0;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
+/*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
@@ -406,7 +635,7 @@ SyncRepInitConfig(void)
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool walsender_cr_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -420,13 +649,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that causal_reads waits for,
+	 * then we have nothing to do. If we are still starting up, still running
+	 * base backup or the current flush position is still invalid, then leave
+	 * quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!walsender_cr_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -464,9 +695,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in causal reads blocking state,
+	 * then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !walsender_cr_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -475,24 +707,35 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for causal_reads, if this walsender
+	 * manages a standby that is in causal reads 'available' or 'joining'
+	 * state.
+	 */
+	if (walsender_cr_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_CAUSAL_READS, MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -970,9 +1213,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -989,7 +1231,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1049,7 +1291,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1100,6 +1342,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgement, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1107,7 +1407,7 @@ SyncRepQueueIsOrderedByLSN(int mode)
  */
 
 bool
-check_synchronous_standby_names(char **newval, void **extra, GucSource source)
+check_standby_names(char **newval, void **extra, GucSource source)
 {
 	if (*newval != NULL && (*newval)[0] != '\0')
 	{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8a249e22b9f..9f1113470c7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -139,9 +140,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *causalReadsLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -466,7 +468,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -511,7 +513,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -569,7 +571,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -874,6 +876,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz causalReadsLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -893,7 +897,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -903,7 +907,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -911,15 +916,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				causalReadsLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &causalReadsLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1082,7 +1089,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1100,9 +1107,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1149,6 +1159,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1281,10 +1292,13 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'causalReadsLease' is a pointer to
+ * the time the primary promises that this standby can safely claim to be
+ * causally consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *causalReadsLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
@@ -1297,6 +1311,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (causalReadsLease != NULL)
+		walrcv->causalReadsLease = *causalReadsLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 8ed7254b5c6..7d557234c47 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -373,3 +374,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for causal reads.
+ */
+bool
+WalRcvCausalReadsAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->causalReadsLease != 0 && now <= walrcv->causalReadsLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f845180873e..a7d6ec5233d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -167,9 +167,23 @@ static StringInfoData tmpbuf;
  */
 static TimestampTz last_reply_timestamp = 0;
 
+static TimestampTz last_keepalive_timestamp = 0;
+
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr causal_reads_joining_until = 0;
+
+/* The last causal reads lease sent to the standby. */
+static TimestampTz causal_reads_last_lease = 0;
+
+/* The last causal reads lease revocation message's number. */
+static int64 causal_reads_revoke_msgno = 0;
+
+/* Is this WALSender listed in causal_reads_standby_names? */
+static bool am_potential_causal_reads_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -239,7 +253,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -281,6 +295,60 @@ InitWalSender(void)
 }
 
 /*
+ * If we are exiting unexpectedly, we may need to communicate with concurrent
+ * causal_reads commits to maintain the causal consistency guarantee.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->causalReadsState == WALSNDCRSTATE_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing causal_reads transactions return control
+		 * until we've stalled for long enough for a zombie standby to start
+		 * raising errors because its lease has expired.  Because our WalSnd
+		 * slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking causal reads lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   causal_reads_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->causalReadsState = WALSNDCRSTATE_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->causalReadsState == WALSNDCRSTATE_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving causal reads set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->causalReadsState = WALSNDCRSTATE_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
  * Clean up after an error.
  *
  * WAL sender processes don't use transactions like regular backends do.
@@ -308,7 +376,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -320,6 +391,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1578,6 +1651,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1594,6 +1668,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1643,6 +1718,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1740,9 +1816,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1750,6 +1828,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1764,17 +1843,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * 10 seconds, then let's clear the lag times that were measured when it
+	 * last wrote/flushed/applied a WAL record.  This way we avoid displaying
+	 * stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= 10000000) /* 10 seconds */
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1790,8 +1869,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_cr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle causal reads state machine. */
+		if (am_potential_causal_reads_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < causal_reads_max_replay_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->causalReadsState)
+			{
+			case WALSNDCRSTATE_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_cr_state = WALSNDCRSTATE_JOINING;
+				break;
+			case WALSNDCRSTATE_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= causal_reads_joining_until)
+						next_cr_state = WALSNDCRSTATE_AVAILABLE;
+				}
+				else
+					next_cr_state = WALSNDCRSTATE_UNAVAILABLE;
+				break;
+			case WALSNDCRSTATE_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_cr_state = WALSNDCRSTATE_REVOKING;
+				break;
+			case WALSNDCRSTATE_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == causal_reads_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_cr_state = WALSNDCRSTATE_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1802,11 +1926,53 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_cr_state != -1)
+			walsnd->causalReadsState = next_cr_state;
+		if (next_cr_state == WALSNDCRSTATE_REVOKING)
+			walsnd->revokingUntil = causal_reads_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/* Post shmem-update actions for causal read state transitions. */
+		switch (next_cr_state)
+		{
+		case WALSNDCRSTATE_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			causal_reads_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining causal reads set...",
+							application_name)));
+			break;
+		case WALSNDCRSTATE_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for causal reads",
+							application_name)));
+			break;
+		case WALSNDCRSTATE_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			causal_reads_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking causal reads lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case WALSNDCRSTATE_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for causal reads",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->causalReadsState >= WALSNDCRSTATE_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -1996,33 +2162,51 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if causal_reads_timeout is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_causal_reads_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_causal_reads_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of causal_reads_lease_time.
+			 */
+			wakeup_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+													  causal_reads_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2038,20 +2222,33 @@ WalSndComputeSleeptime(TimestampTz now)
 /*
  * Check whether there have been responses by the client within
  * wal_sender_timeout and shutdown if not.
+ *
+ * If causal_reads_timeout is configured we override that, so that
+ * unresponsive standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a causal_reads support is configured, we use causal_reads_lease_time
+	 * instead of wal_sender_timeout, to limit the time before an unresponsive
+	 * causal reads standby is dropped.
+	 */
+	if (am_potential_causal_reads_standby)
+		allowed_time = causal_reads_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && now >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && now >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2079,6 +2276,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	/* Report to pgstat that this process is running */
 	pgstat_report_activity(STATE_RUNNING, NULL);
 
+	/* Check if we are managing potential causal_reads standby. */
+	am_potential_causal_reads_standby = CausalReadsPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2243,6 +2443,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->causalReadsState = WALSNDCRSTATE_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3125,6 +3326,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the causal reads state. This is used
+ * in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetCausalReadsStateString(WalSndCausalReadsState causal_reads_state)
+{
+	switch (causal_reads_state)
+	{
+	case WALSNDCRSTATE_UNAVAILABLE:
+		return "unavailable";
+	case WALSNDCRSTATE_JOINING:
+		return "joining";
+	case WALSNDCRSTATE_AVAILABLE:
+		return "available";
+	case WALSNDCRSTATE_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3144,7 +3366,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3197,6 +3419,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		TimeOffset	applyLag;
 		int			priority;
 		WalSndState state;
+		WalSndCausalReadsState causalReadsState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3206,6 +3429,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		SpinLockAcquire(&walsnd->mutex);
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		causalReadsState = walsnd->causalReadsState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3288,6 +3512,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetCausalReadsStateString(causalReadsState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3303,21 +3530,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz causal_reads_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a causal reads lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->causalReadsState != WALSNDCRSTATE_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		causal_reads_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = causal_reads_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		causal_reads_last_lease =
+		TimestampTzPlusMilliseconds(now, causal_reads_lease_time);
+		/* Adjust the version we send for clock skew. */
+		causal_reads_lease =
+			TimestampTzPlusMilliseconds(causal_reads_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, causal_reads_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3332,23 +3607,35 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_causal_reads_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If causal_reads_max_replay_lag has been configured, use
+	 * causal_reads_lease_time to control keepalive intervals rather than
+	 * wal_sender_timeout, so that we can keep replacing leases at the right
+	 * frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_causal_reads_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+												causal_reads_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (now >= ping_time)
 	{
 		WalSndKeepalive(true);
 		waiting_for_ping_response = true;
+		last_keepalive_timestamp = now;
 
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 4f354717628..89e49e2c42e 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -307,6 +307,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_CAUSAL_READS_NOT_AVAILABLE                         causal_reads_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 82e54c084b8..d746e6eb0bf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1647,6 +1647,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"causal_reads", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables causal reads."),
+		 NULL
+		},
+		&causal_reads,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
 			NULL
@@ -2885,6 +2895,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"causal_reads_max_replay_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before causal reads standbys are no longer available."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&causal_reads_max_replay_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"causal_reads_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases used to implement causal reads."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&causal_reads_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3563,7 +3595,18 @@ static struct config_string ConfigureNamesString[] =
 		},
 		&SyncRepStandbyNames,
 		"",
-		check_synchronous_standby_names, assign_synchronous_standby_names, NULL
+		check_standby_names, assign_synchronous_standby_names, NULL
+	},
+
+	{
+		{"causal_reads_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential causal reads standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&causal_reads_standby_names,
+		"*",
+		check_standby_names, NULL, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2b1ebb797ec..91cdc65f218 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -250,6 +250,18 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#causal_reads_max_replay_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the set of
+					# available causal reads peers; 0 to disable
+					# causal  reads
+
+#causal_reads_lease_time = 5s		# how long individual leases granted to causal
+					# reads standbys should last; should be 4 times
+					# the max possible clock skew
+
+#causal_reads_standy_names = '*'	# standby servers that can potentially become
+					# available for causal reads; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -279,6 +291,14 @@
 #max_logical_replication_workers = 4	# taken from max_worker_processes
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#causal_reads = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for causal reads
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6369be78a31..9ed1aa66042 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -332,6 +334,16 @@ GetTransactionSnapshot(void)
 				 "cannot take query snapshot during a parallel operation");
 
 		/*
+		 * In causal_reads mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (causal_reads && RecoveryInProgress() && !WalRcvCausalReadsAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_CAUSAL_READS_NOT_AVAILABLE),
+					 errmsg("standby is not available for causal reads")));
+
+		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
 		 * make a copy of it rather than returning CurrentSnapshotData
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 6811a55e764..02eaf97247f 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -117,7 +117,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -150,6 +150,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 15932c60b5a..501ecc849d1 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -325,7 +325,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -343,6 +343,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 1191b4ab1bd..bd00f374f53 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2832,7 +2832,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,causal_reads_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6bffe63ad6b..8f383c128ae 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -801,6 +801,7 @@ typedef enum
 	WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
 	WAIT_EVENT_BGWORKER_STARTUP,
 	WAIT_EVENT_BTREE_PAGE,
+	WAIT_EVENT_CAUSAL_READS_APPLY,
 	WAIT_EVENT_EXECUTE_GATHER,
 	WAIT_EVENT_MQ_INTERNAL,
 	WAIT_EVENT_MQ_PUT_MESSAGE,
@@ -824,6 +825,7 @@ typedef enum
 typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+	WAIT_EVENT_CAUSAL_READS_REVOKE,
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY
 } WaitEventTimeout;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ceafe2cbea1..3d8c254ad46 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_CAUSAL_READS	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int causal_reads_max_replay_lag;
+extern int causal_reads_lease_time;
+extern bool causal_reads;
+extern char *causal_reads_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,15 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by user backend (xact.c) */
+extern void CausalReadsWaitForLSN(XLogRecPtr XactCommitLSN);
+
+/* called by wal sender */
+extern void CausalReadsRevokeLease(TimestampTz lease_expiry_time);
+extern bool CausalReadsPotentialStandby(void);
+
 /* GUC infrastructure */
-extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c8652dbd489..6eb188c88c1 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	TimeLineID	receivedTLI;
 
 	/*
+	 * causalReadsLease is the time until which the primary has authorized
+	 * this standby to consider itself available for causal_reads mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz causalReadsLease;
+
+	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
 	 * receivedUpto before the last flush to disk.  Startup process can use
@@ -298,4 +305,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvCausalReadsAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0aa80d5c3e2..b5b4d392f8c 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum WalSndCausalReadsState
+{
+	WALSNDCRSTATE_UNAVAILABLE = 0,
+	WALSNDCRSTATE_JOINING,
+	WALSNDCRSTATE_AVAILABLE,
+	WALSNDCRSTATE_REVOKING
+} WalSndCausalReadsState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  */
@@ -53,6 +61,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Causal reads state for this walsender. */
+	WalSndCausalReadsState causalReadsState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -94,6 +106,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in causal_reads stall?  This is used to wait
+	 * for causal reads leases to expire when a walsender exists uncleanly,
+	 * and we must stall causal reads commits until we're sure that the remote
+	 * server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2e42b9ec05f..1f5801bc450 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1859,9 +1859,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.causal_reads_state
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, causal_reads_state) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Thomas Munro (#5)

1 attachment(s)

Re: Causal reads take II

On Fri, Jun 23, 2017 at 11:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Apply the patch after first applying a small bug fix for replication
lag tracking[4]. Then:

That bug fix was committed, so now causal-reads-v17.patch can be
applied directly on top of master.

1. Set up some streaming replicas.
2. Stick causal_reads_max_replay_lag = 2s (or any time you like) in
the primary's postgresql.conf.
3. Set causal_reads = on in some transactions on various nodes.
4. Try to break it!

Someone asked me off-list how to set this up quickly and easily for
testing. Here is a shell script that will start up a primary server
(port 5432) and 3 replicas (ports 5441 to 5443). Set the two paths at
the top of the file before running in. Log in with psql postgres [-p
<port>], then SET causal_reads = on to test its effect.
causal_reads_max_replay_lay is set to 2s and depending on your
hardware you might find that stuff like CREATE TABLE big_table AS
SELECT generate_series(1, 10000000) or a large COPY data load causes
replicas to be kicked out of the set after a while; you can also pause
replay on the replicas with SELECT pg_wal_replay_pause() and
pg_wal_replay_resume(), kill -STOP/-CONT or -9 the walreceiver
processes to similar various failure modes, or run the replicas
remotely and unplug the network. SELECT application_name, replay_lag,
causal_reads_state FROM pg_state_replication to see the current
situation, and also monitor the primary's LOG messages about
transitions. You should find that the
"read-your-writes-or-fail-explicitly" guarantee is upheld, no matter
what you do, and furthermore than failing or lagging replicas don't
hold hold the primary up very long: in the worst case
causal_reads_lease_time for lost contact, and in the best case the
time to exchange a couple of messages with the standby to tell it its
lease is revoked and it should start raising an error. You might find
test-causal-reads.c[1]/messages/by-id/CAEepm=3NF=7eLkVR2fefVF9bg6RxpZXoQFmOP3RWE4r4iuO7vg@mail.gmail.com useful for testing.

Maybe it needs a better name.

Ok, how about this: the feature could be called "synchronous replay".
The new column in pg_stat_replication could be called sync_replay
(like the other sync_XXX columns). The GUCs could be called
synchronous replay, synchronous_replay_max_lag and
synchronous_replay_lease_time. The language in log messages could
refer to standbys "joining the synchronous replay set".

Restating the purpose of the feature with that terminology: If
synchronous_replay is set to on, then you see the effects of all
synchronous_replay = on transactions that committed before your
transaction began, or an error is raised if that is not possible on
the current node. This allows applications to direct read-only
queries to read-only replicas for load balancing without seeing stale
data. Is that clearer?

Restating the relationship with synchronous replication with that
terminology: while synchronous_commit and synchronous_standby_names
are concerned with distributed durability, synchronous_replay is
concerned with distributed visibility. While the former prevents
commits from returning if the configured level of durability isn't met
(for example "must be flushed on master + any 2 standbys"), the latter
will simply drop any standbys from the synchronous replay set if they
fail or lag more than synchronous_replay_max_lag. It is reasonable to
want to use both features at once: my policy on distributed
durability might be that I want all transactions to be flushed to disk
on master + any of three servers before I report information to users,
and my policy on distributed visibility might be that I want to be
able to run read-only queries on any of my six read-only replicas, but
don't want to wait for any that lag by more than 1 second.

Thoughts?

[1]: /messages/by-id/CAEepm=3NF=7eLkVR2fefVF9bg6RxpZXoQFmOP3RWE4r4iuO7vg@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Simon Riggs

simon@2ndquadrant.com

over 8 years ago

In reply to: Thomas Munro (#1)

Re: Causal reads take II

On 3 January 2017 at 01:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Here is a new version of my "causal reads" patch (see the earlier
thread from the 9.6 development cycle[1]), which provides a way to
avoid stale reads when load balancing with streaming replication.

I'm very happy that you are addressing this topic.

I noticed you didn't put in links my earlier doubts about this
specific scheme, though I can see doubts from myself and Heikki at
least in the URLs. I maintain those doubts as to whether this is the
right way forwards.

This patch presumes we will load balance writes to a master and reads
to a pool of standbys. How will we achieve that?

1. We decorate the application with additional info to indicate
routing/write concerns.
2. We get middleware to do routing for us, e.g. pgpool style read/write routing

The explicit premise of the patch is that neither of the above options
are practical, so I'm unclear how this makes sense. Is there some use
case that you have in mind that has not been fully described? If so,
lets get it on the table.

What I think we need is a joined up plan for load balancing, so that
we can understand how it will work. i.e. explain the whole use case
and how the solution works.

I'm especially uncomfortable with any approaches that treat all
sessions as one pool. For me, a server should support multiple pools.
Causality seems to be a property of a particular set of pools. e.g.
PoolS1 supports causal reads against writes to PoolM1 but not PoolM2,
yet PoolS2 does not provide causal reads against PoolM1 orPoolM2.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Simon Riggs (#7)

Re: Causal reads take II

On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm very happy that you are addressing this topic.

I noticed you didn't put in links my earlier doubts about this
specific scheme, though I can see doubts from myself and Heikki at
least in the URLs. I maintain those doubts as to whether this is the
right way forwards.

One technical problem was raised in the earlier thread by Ants Aasma
that I concede may be fatal to this design (see note below about
read-follows-read below), but I'm not sure. All the other discussion
seemed to be about trade-offs between writer-waits and reader-waits
schemes, both of which I still view as reasonable options for an end
user to have in the toolbox. Your initial reaction was:

What we want is to isolate the wait only to people performing a write-read
sequence, so I think it should be readers that wait.

I agree with you 100%, up to the comma. The difficulty is identifying
which transactions are part of a write-read sequence. An
application-managed LSN tracking system allows for waits to occur
strictly in reads that are part of a write-read sequence because the
application links them explicitly, and nobody is arguing that we
shouldn't support that for hard working expert users. But to support
applications that don't deal with LSNs (or some kind of "causality
tokens") explicitly I think we'll finish up having to make readers
wait for incidental later transactions too, not just the write that
your read is dependent on, as I'll show below. When you can't
identify write-read sequences perfectly, it comes down to a choice
between taxing writers or taxing readers, and I'm not sure that one
approach is universally better. Let me summarise my understanding of
that trade-off.

I'm going to use this terminology:

synchronous replay = my proposal: ask the primary server to wait until
standbys have applied tx1, a bit like 9.6 synchronous_commit =
remote_apply, but with a system of leases for graceful failure.

causality tokens = the approach Heikki mentioned: provide a way for
tx1 to report its commit LSN to the client, then provide a way for a
client to wait for the LSN to be replayed before taking a snapshot for
tx2.

tx1, tx2 = a pair of transactions with a causal dependency; we want
tx2 to see tx1 because tx1 caused tx2 in some sense so must be seen to
precede it.

A poor man's causality token system can be cobbled together today with
pg_current_wal_lsn() and a polling loop that checks
pg_last_wal_replay_lsn(). It's a fairly obvious thing to want to do.
Several people including Heikki Linnakangas, Craig Ringer, Ants Aasma,
Ivan Kartyshov and probably many others have discussed better ways to
do that[1]/messages/by-id/53E2D346.9030806@2ndquadrant.com, and a patch for the wait-for-LSN piece appeared in a
recent commitfest[2]/messages/by-id/0240c26c-9f84-30ea-fca9-93ab2df5f305@postgrespro.ru. I reviewed Ivan's patch and voted -1 only
because it didn't work for higher isolation levels. If he continues
to develop that I will be happy to review and help get it into
committable shape, and if he doesn't I may try to develop it myself.
In short, I think this is a good tool to have in the toolbox and
PostgreSQL 11 should have it! But I don't think it necessarily
invalidates my synchronous replay proposal: they represent different
sets of trade-offs and might appeal to different users. Here's why:

To actually use a causality token system you either need a carefully
designed application that keeps track of causal dependencies and
tokens, in which case the developer works harder but can benefit from
from an asynchronous pipelining effect (by the time tx2 runs we hope
that tx1 has been applied, so neither transaction had to wait). Let's
call that "application-managed causality tokens". That's great for
those users -- let's make that possible -- but most users don't want
to write code like that. So I see approximately three choices for
transparent middleware (or support built into standbys), which I'll
name and describe as follows:

1. "Panoptic" middleware: Sees all queries so that it can observe
commit LSNs and inject wait directives into all following read-only
transactions. Since all queries now need to pass through a single
process, you have a new bottleneck, an extra network hop, and a single
point of failure so you'll probably want a failover system with
split-brain defences.

2. "Hybrid" middleware: The middleware (or standby itself) listens to
the replication stream so that it knows about commit LSNs (rather than
spying on commits). The primary server waits until all connected
middleware instances acknowledge commit LSNs before it releases
commits, and then the middleware inserts wait-for-LSN directives into
read-only transactions. Now there is no single point problem, but
writers are impacted too. I mention this design because I believe
this is conceptually similar to how Galera wsrep_sync_wait (AKA
wsrep_causal_reads) works. (I call this "hybrid" because it splits
the waiting between tx1 and tx2. Since it's synchronous, dealing with
failure gracefully is tricky, probably needing a lease system like SR.
I acknowledge that comparisons between our streaming replication and
Galera are slightly bogus because Galera is a synchronous multi-master
system.)

3. "Forward-only" middleware (insert better name): The middleware (or
standby itself) asks the primary server for the latest committed LSN
at the start of every transaction, and then tells the standby to wait
for that LSN to be applied.

There are probably some other schemes involving communicating
middleware instances, but I don't think they'll be better in ways that
matter -- am I wrong?

Here's a trade-off table:

SR AT PT HT FT
tx1 commit waits? yes no no yes no
tx2 snapshot waits? no yes yes yes yes
tx2 waits for incidental transactions? no no yes yes yes
tx2 has round-trip to primary? no no no no yes
can tx2's wait be pipelined? yes no* no* no*

SR = synchronous replay
AT = application-managed causality tokens
PT = panoptic middleware-managed causality tokens
HT = hybrid middleware-managed or standby-managed causality tokens
FT = forward-only middleware-managed causality tokens

*Note that only synchronous replay and application-managed causality
tokens track the actual causal dependency tx2->tx1. SR does it by
making tx1 wait for replay so that tx2 doesn't have to wait at all and
AT does it by making tx2 wait specifically for tx1 to be applied. PT,
HT and FT don't actually know anything about tx1, so they make every
read query wait until *all known transactions* are applied
("incidental transactions" above), throwing away the pipelining
benefits of causality token systems (hence "no*" above). I haven't
used it myself but I have heard that that is why read latency is a
problem on Galera with causal reads mode enabled: due to lack of
better information you have to wait for the replication system to
drain its current apply queue before every query is processed, even if
tx1 in your causally dependent transaction pair was already visible on
the current node.

So far I think that SR and AT are sweet spots. AT for people who are
prepared to juggle causality tokens in their applications and SR for
people who want to remain oblivious to all this stuff and who can
tolerate a reduction in single-client write TPS. I also think AT's
pipelining advantage over SR and SR's single-client TPS impact are
diminished if you also choose to enable syncrep for durability, which
isn't a crazy thing to want to do if you're doing anything important
with your data. The other models where all readers wait for
incidental transactions don't seem terribly attractive to me,
especially if the motivating premise of load balancing with read-only
replicas is (to steal a line) "ye [readers] are many, they are few".

One significant blow my proposal received in the last thread was a
comment from Ants about read-follows-read[3]/messages/by-id/CAEepm=15WC7A9Zdj2Qbw3CUDXWHe69d=nBpf+jXui7OYXXq11w@mail.gmail.com. What do you think? I
suspect the same problem applies to causality token based systems as
discussed so far (except perhaps FT, the slowest and probably least
acceptable to anyone). On the other hand, I think it's at least
possible to fix that problem with causality tokens. You'd have to
expose and capture the last-commit-LSN for every snapshot used in
every single read query, and wait for it at the moment ever following
read query takes a new snapshot. This would make AT even harder work
for developers, make PT even slower, and make HT unworkable (it only
knows about commits, not reads). I also suspect that a sizeable class
of applications cares a lot less about read-follows-read than
read-follows-write, but I could be wrong about that.

This patch presumes we will load balance writes to a master and reads
to a pool of standbys. How will we achieve that?

1. We decorate the application with additional info to indicate
routing/write concerns.
2. We get middleware to do routing for us, e.g. pgpool style read/write routing

The explicit premise of the patch is that neither of the above options
are practical, so I'm unclear how this makes sense. Is there some use
case that you have in mind that has not been fully described? If so,
lets get it on the table.

I don't think that pgpool routing is impractical, just that it's not a
great place to put transparent causality token tracking for the
reasons I've explained above -- you'll introduce a ton of latency for
all readers because you can't tell which earlier transactions they
might be causally dependent on. I think it's also nice to be able to
support the in-process connection pooling and routing that many
application developers use to avoid extra hops, so it'd be nice to
avoid making pooling/routing/proxy servers strictly necessary.

What I think we need is a joined up plan for load balancing, so that
we can understand how it will work. i.e. explain the whole use case
and how the solution works.

Here are some ways you could set a system up:

1. Use middleware like pgpool or pgbouncer-rr to route queries
automatically; this is probably limited to single-statement queries,
since multi-statement queries can't be judged by their first statement
alone. (Those types of systems could be taught to understand a
request for a connection with causal reads enabled, and look at the
current set of usable standbys by looking at the pg_stat_replication
table.)

2. Use the connection pooling inside your application server or
application framework/library: for example Hibernate[4]https://stackoverflow.com/questions/25911359/read-write-splitting-hibernate, Django[5]https://github.com/yandex/django_replicated (for example; several similar extensions exist) and
many other libraries offer ways to configure multiple database
connection pools and route queries appropriately at a fairly high
level. Such systems could probably be improved to handle 'synchronous
replay not available' errors by throwing away the connection and
retrying automatically on another connection, much as they do for
serialization failures and deadlocks.

3. Modify your application to deal with separate connection pools
directly wherever it runs database transactions.

Perhaps I'm not thinking big enough: I tried to come up with an
incremental improvement to PostgreSQL that would fix a problem that I
know people have with their current hot standby deployment. I
deliberately avoided proposing radical architectural projects such as
moving cluster management, discovery, proxying, pooling and routing
responsibilities into PostgreSQL. Perhaps those working on GTM type
systems which effectively present a seamless single system find this
whole discussion to be aiming too low and dealing with the wrong
problems.

I'm especially uncomfortable with any approaches that treat all
sessions as one pool. For me, a server should support multiple pools.
Causality seems to be a property of a particular set of pools. e.g.
PoolS1 supports causal reads against writes to PoolM1 but not PoolM2,
yet PoolS2 does not provide causal reads against PoolM1 orPoolM2.

Interesting, but I don't immediately see any fundamental difficulty
for any of the designs discussed. For example, maybe tx1 should be
able to set synchronous_replay = <group name>, rather than just 'on',
to refer to a ground of standbys defined in some GUC.

Just by the way, while looking for references I found
PinningMasterSlaveRouter which provides a cute example of demand for
causal reads (however implemented) in the Django community:

https://github.com/jbalogh/django-multidb-router

It usually sends read-only transactions to standbys, but keeps your
web session temporarily pinned to the primary database to give you 15
seconds' worth of read-your-writes after each write transaction.

[1]: /messages/by-id/53E2D346.9030806@2ndquadrant.com
[2]: /messages/by-id/0240c26c-9f84-30ea-fca9-93ab2df5f305@postgrespro.ru
[3]: /messages/by-id/CAEepm=15WC7A9Zdj2Qbw3CUDXWHe69d=nBpf+jXui7OYXXq11w@mail.gmail.com
[4]: https://stackoverflow.com/questions/25911359/read-write-splitting-hibernate
[5]: https://github.com/yandex/django_replicated (for example; several similar extensions exist)
similar extensions exist)

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Thomas Munro (#6)

3 attachment(s)

Re: Causal reads take II

On Sat, Jun 24, 2017 at 4:05 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Jun 23, 2017 at 11:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Maybe it needs a better name.

Ok, how about this: the feature could be called "synchronous replay".
The new column in pg_stat_replication could be called sync_replay
(like the other sync_XXX columns). The GUCs could be called
synchronous replay, synchronous_replay_max_lag and
synchronous_replay_lease_time. The language in log messages could
refer to standbys "joining the synchronous replay set".

Feature hereby renamed that way. It seems a lot more
self-explanatory. Please see attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

synchronous-replay-v1.patchapplication/octet-stream; name=synchronous-replay-v1.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2485e6190dc..9bf46f0ba13 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2905,6 +2905,36 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</> set to <literal>on</> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Server(s)</title>
 
@@ -3206,6 +3236,66 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+       <para>
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby, as
+        described in <xref linkend="synchronous-replay">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</>, as described in
+        <xref linkend="synchronous-replay">.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</></>.
+        The default is <literal>*</>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index e41df791b76..b8ff329e1ea 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1115,7 +1115,7 @@ primary_slot_name = 'node_a_slot'
     cause each commit to wait until the current synchronous standbys report
     that they have replayed the transaction, making it visible to user
     queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    consistency.  See also <xref linkend="synchronous-replay">.
    </para>
 
    <para>
@@ -1313,6 +1313,119 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</> to <literal>on</> in any pair of
+    consecutive transactions tx1, tx2 where tx2 begins after tx1 successfully
+    returns, then tx2 will either see tx1 or fail with a new error "standby is
+    not available for synchronous replay", no matter which server it runs on.
+    Although the guarantee is expressed in terms of two individual
+    transactions, the GUC can also be set at session, role or system level to
+    make the guarantee generally, allowing for load balancing of applications
+    that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the feature, <varname>synchronous_replay_max_lag</>
+    must be set to a non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</> can be used to limit the
+    set of standbys that can join the dynamic set of synchronous replay
+    standbys by providing a comma-separated list of application names.  By
+    default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</></>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</> to <literal>remote_apply</> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</>.  For example,
+    if <varname>synchronous_replay_lease_time</> is set to <literal>5s</>,
+    then the clocks must not be more than 1.25 second apart for the guarantee
+    to be upheld reliably during transitions.  The ubiquity of the Network
+    Time Protocol (NTP) on modern operating systems and availability of high
+    quality time servers makes it possible to choose a tolerance significantly
+    higher than the maximum expected clock difference.  An effort is
+    nevertheless made to detect and report misconfigured and faulty systems
+    with clock differences greater than the configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</> below the default of 5
+     seconds (allowing up to 1.25 second of clock difference) after
+     researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1661,7 +1774,16 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</> with respect to a transaction that
+    has completed on the primary by setting <varname>synchronous_replay</>
+    to <literal>on</> in both transactions.  For more details,
+    see <xref linkend="synchronous-replay">.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index be3dc672bcc..c48243362c2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1790,6 +1790,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</></entry>
+     <entry><type>text</></entry>
+     <entry>Synchronous replay state of this standby server.  This field will be
+     non-null only if <varname>synchronous_replay_max_lag</> is set.  If a standby is
+     in <literal>available</> state, then it can currently serve synchronous replay
+     queries.  If it is not replaying fast enough or not responding to
+     keepalive messages, it will be in <literal>unavailable</> state, and if
+     it is currently transitioning to availability it will be
+     in <literal>joining</> state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b0aa69fe4b4..deb14e346a5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5149,7 +5149,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0fdad0c1197..cc8b565386f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -732,7 +732,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.sync_replay
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a0b0eecbd5e..b3074a6578e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3606,6 +3606,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 		case WAIT_EVENT_LOGICAL_SYNC_DATA:
 			event_name = "LogicalSyncData";
 			break;
@@ -3640,6 +3643,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 898c497d12c..3eb79a0fd2b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1295,6 +1295,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 5fd47689dd2..d794bef1d54 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -129,6 +138,229 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_SYNC_REPLAY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
+/*
  * Wait for synchronous replication, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
@@ -149,11 +381,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
@@ -169,6 +399,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -229,57 +465,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgement, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -399,6 +587,53 @@ SyncRepInitConfig(void)
 }
 
 /*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
+/*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
@@ -406,7 +641,7 @@ SyncRepInitConfig(void)
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool walsender_sr_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -420,13 +655,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!walsender_sr_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -464,9 +701,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !walsender_sr_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -475,24 +713,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (walsender_sr_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -970,9 +1220,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -989,7 +1238,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1049,7 +1298,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1100,6 +1349,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgement, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1169,6 +1476,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8a249e22b9f..c467a32d306 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -139,9 +140,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -466,7 +468,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -511,7 +513,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -569,7 +571,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -874,6 +876,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -893,7 +897,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -903,7 +907,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -911,15 +916,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1082,7 +1089,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1100,9 +1107,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1149,6 +1159,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1281,10 +1292,13 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
@@ -1297,6 +1311,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1334,7 +1350,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 8ed7254b5c6..dec98eb48c8 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -373,3 +374,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f845180873e..9563a87b08d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -167,9 +167,23 @@ static StringInfoData tmpbuf;
  */
 static TimestampTz last_reply_timestamp = 0;
 
+static TimestampTz last_keepalive_timestamp = 0;
+
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -239,7 +253,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -281,6 +295,61 @@ InitWalSender(void)
 }
 
 /*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
  * Clean up after an error.
  *
  * WAL sender processes don't use transactions like regular backends do.
@@ -308,7 +377,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -320,6 +392,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1578,6 +1652,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1594,6 +1669,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1643,6 +1719,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1740,9 +1817,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1750,6 +1829,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1764,17 +1844,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * 10 seconds, then let's clear the lag times that were measured when it
+	 * last wrote/flushed/applied a WAL record.  This way we avoid displaying
+	 * stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= 10000000) /* 10 seconds */
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1790,8 +1870,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1802,11 +1927,55 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -1996,33 +2165,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2038,20 +2226,33 @@ WalSndComputeSleeptime(TimestampTz now)
 /*
  * Check whether there have been responses by the client within
  * wal_sender_timeout and shutdown if not.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && now >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && now >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2079,6 +2280,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	/* Report to pgstat that this process is running */
 	pgstat_report_activity(STATE_RUNNING, NULL);
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2243,6 +2447,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3125,6 +3330,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3144,7 +3370,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3197,6 +3423,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		TimeOffset	applyLag;
 		int			priority;
 		WalSndState state;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3206,6 +3433,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		SpinLockAcquire(&walsnd->mutex);
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3288,6 +3516,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3303,21 +3534,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3332,23 +3611,35 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (now >= ping_time)
 	{
 		WalSndKeepalive(true);
 		waiting_for_ping_response = true;
+		last_keepalive_timestamp = now;
 
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
@@ -3388,7 +3679,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == LagTracker.read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 4f354717628..d1751f6e0c0 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -307,6 +307,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 82e54c084b8..1832bdf4de0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1647,6 +1647,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
 			NULL
@@ -2885,6 +2895,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3567,6 +3599,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
+	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2b1ebb797ec..e6dbcb58bbd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -250,6 +250,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -279,6 +290,14 @@
 #max_logical_replication_workers = 4	# taken from max_worker_processes
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 08a08c8e8fc..55aef58fcd2 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -332,6 +334,17 @@ GetTransactionSnapshot(void)
 				 "cannot take query snapshot during a parallel operation");
 
 		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
+		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
 		 * make a copy of it rather than returning CurrentSnapshotData
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 6811a55e764..02eaf97247f 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -117,7 +117,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -150,6 +150,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 15932c60b5a..501ecc849d1 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -325,7 +325,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -343,6 +343,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 8b33b4e0ea7..106f87989fb 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2832,7 +2832,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6bffe63ad6b..69e8c5bbc1b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -811,6 +811,7 @@ typedef enum
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
 	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
 } WaitEventIPC;
@@ -825,7 +826,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ceafe2cbea1..e2bc88f7c23 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index c8652dbd489..0e396def022 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	TimeLineID	receivedTLI;
 
 	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
+	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
 	 * receivedUpto before the last flush to disk.  Startup process can use
@@ -298,4 +305,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0aa80d5c3e2..ac025ad535b 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  */
@@ -53,6 +61,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -94,6 +106,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2e42b9ec05f..9df755a72ad 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1859,9 +1859,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.sync_replay
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,

test-synchronous-replay.shapplication/x-sh; name=test-synchronous-replay.shDownload

test-synchronous-replay.ctext/x-csrc; charset=US-ASCII; name=test-synchronous-replay.cDownload

#10

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Thomas Munro (#8)

Re: Causal reads take II

On Tue, Jun 27, 2017 at 12:20 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

What I think we need is a joined up plan for load balancing, so that
we can understand how it will work. i.e. explain the whole use case
and how the solution works.

Here's a proof-of-concept hack of the sort of routing and retry logic
that I think should be feasible with various modern application stacks
(given the right extensions):

https://github.com/macdice/py-pgsync/blob/master/DemoSyncPool.py

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Dmitry Dolgov

9erthalion6@gmail.com

over 8 years ago

In reply to: Thomas Munro (#10)

Re: Causal reads take II

On 23 June 2017 at 13:48, Thomas Munro <thomas.munro@enterprisedb.com>

wrote:

Apologies for the extended delay. Here is the rebased patch, now with a
couple of improvements (see below).

Thank you. I started to play with it a little bit, since I think it's an
interesting idea. And there are already few notes:

* I don't see a CF item for that, where is it?

* Looks like there is a sort of sensitive typo in `postgresql.conf.sample`:

```
+#causal_reads_standy_names = '*' # standby servers that can potentially
become
+ # available for causal reads; '*' = all
+
```

it should be `causal_reads_standby_names`. Also I hope in the nearest
future I
can provide a full review.

#12

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Dmitry Dolgov (#11)

Re: Causal reads take II

On Thu, Jul 13, 2017 at 2:51 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Thank you. I started to play with it a little bit, since I think it's an
interesting idea. And there are already few notes:

Thanks Dmitry.

* I don't see a CF item for that, where is it?

https://commitfest.postgresql.org/14/951/

The latest version of the patch is here:

/messages/by-id/CAEepm=0YigNQczAF-=x_SxT6cJv77Yb0EO+cAFnqRyVu4+bKFw@mail.gmail.com

I renamed it to "synchronous replay", because "causal reads" seemed a
bit too arcane.

* Looks like there is a sort of sensitive typo in `postgresql.conf.sample`:
```
+#causal_reads_standy_names = '*' # standby servers that can potentially
become
+ # available for causal reads; '*' = all
+
```
it should be `causal_reads_standby_names`.

Fixed in latest version (while renaming).

Also I hope in the nearest future
I
can provide a full review.

Great news, thanks!

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Simon Riggs (#7)

Re: Causal reads take II

On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 3 January 2017 at 01:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Here is a new version of my "causal reads" patch (see the earlier
thread from the 9.6 development cycle[1]), which provides a way to
avoid stale reads when load balancing with streaming replication.

I'm very happy that you are addressing this topic.

I noticed you didn't put in links my earlier doubts about this
specific scheme, though I can see doubts from myself and Heikki at
least in the URLs. I maintain those doubts as to whether this is the
right way forwards.

This patch presumes we will load balance writes to a master and reads
to a pool of standbys. How will we achieve that?

1. We decorate the application with additional info to indicate
routing/write concerns.
2. We get middleware to do routing for us, e.g. pgpool style read/write routing

The explicit premise of the patch is that neither of the above options
are practical, so I'm unclear how this makes sense. Is there some use
case that you have in mind that has not been fully described? If so,
lets get it on the table.

What I think we need is a joined up plan for load balancing, so that
we can understand how it will work. i.e. explain the whole use case
and how the solution works.

Simon,

Here's a simple proof-of-concept Java web service using Spring Boot
that demonstrates how load balancing could be done with this patch.
It show two different techniques for routing: an "adaptive" one that
learns which transactional methods need to run on the primary server
by intercepting errors, and a "declarative" one that respects Spring's
@Transactional(readOnly=true) annotations (inspired by the way people
use MySQL Connector/J with Spring to do load balancing). Whole
transactions are automatically retried at the service request level on
transient failures using existing techniques (Spring Retry, as used
for handling deadlocks and serialisation failures etc), and the
"TransactionRouter" avoids servers that have recently raised the
"synchronous_replay not available" error. Aside from the optional
annotations, the application code in KeyValueController.java is
unaware of any of this.

https://github.com/macdice/syncreplay-spring-demo

I suspect you could find ways to do similar things with basically any
application development stack that supports some kind of container
managed transactions.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Dmitry Dolgov

9erthalion6@gmail.com

over 8 years ago

In reply to: Thomas Munro (#13)

Re: Causal reads take II

On 12 July 2017 at 23:45, Thomas Munro <thomas.munro@enterprisedb.com>

wrote:

I renamed it to "synchronous replay", because "causal reads" seemed a bit

too

arcane.

I looked through the code of `synchronous-replay-v1.patch` a bit and ran a
few
tests. I didn't manage to break anything, except one mysterious error that
I've
got only once on one of my replicas, but I couldn't reproduce it yet.
Interesting thing is that this error did not affect another replica or
primary.
Just in case here is the log for this error (maybe you can see something
obvious, that I've not noticed):

```
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211":
Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211":
Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: directories for tablespace 47733 could not be removed
HINT: You can remove the directories manually if necessary.
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
FATAL: could not create directory "pg_tblspc/47734/PG_10_201707211/47732":
File exists
CONTEXT: WAL redo at 0/125F5768 for Storage/CREATE:
pg_tblspc/47734/PG_10_201707211/47732/47736
LOG: startup process (PID 8034) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down
```

And speaking about the code, so far I have just a few notes (some of them
merely questions):

* In general the idea behind this patch sounds interesting for me, but it
relies heavily on time synchronization. As mentioned in the documentation:
"Current hardware clocks, NTP implementations and public time servers are
unlikely to allow the system clocks to differ more than tens or hundreds
of
milliseconds, and systems synchronized with dedicated local time servers
may
be considerably more accurate." But as far as I remember from my own
experience sometimes it maybe not that trivial on something like AWS
because
of virtualization. Maybe it's an unreasonable fear, but is it possible to
address this problem somehow?

* Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
time shift when we're dealing with leases). Does it make sense to move
them
out and make them configurable?

* Judging from the `SyncReplayPotentialStandby` function, it's possible to
have
`synchronous_replay_standby_names = "*, server_name"`, which is basically
an
equivalent for just `*`, but it looks confusing. Is it worth it to prevent
this behaviour?

* In the same function `SyncReplayPotentialStandby` there is this code:

```
if (!SplitIdentifierString(rawstring, ',', &elemlist))
{
/* syntax error in list */
pfree(rawstring);
list_free(elemlist);
/* GUC machinery will have already complained - no need to do again */
return false;
}
```

Am I right that ideally this (a situation when at this point in the code
`synchronous_replay_standby_names` has incorrect value) should not happen,
because GUC will prevent us from that? If yes, then it looks for me that
it
still makes sense to put here a log message, just to give more
information in
a potentially weird situation.

* In the function `SyncRepReleaseWaiters` there is a commentary:

```
/*
* If the number of sync standbys is less than requested or we aren't
* managing a sync standby or a standby in synchronous replay state that
* blocks then just leave.
* /
if ((!got_recptr || !am_sync) && !walsender_sr_blocker)
```

Is this commentary correct? If I understand everything right
`!got_recptr` -
the number of sync standbys is less than requested (a), `!am_sync` - we
aren't
managing a sync standby (b), `walsender_sr_blocker` - a standby in
synchronous
replay state that blocks (c). Looks like condition is `(a or b) and not
c`.

* In the function `ProcessStandbyReplyMessage` there is a code that
implements
this:

```
* If the standby reports that it has fully replayed the WAL for at
least
* 10 seconds, then let's clear the lag times that were measured when it
* last wrote/flushed/applied a WAL record. This way we avoid displaying
* stale lag data until more WAL traffic arrives.
```
but I never found any mention of this 10 seconds in the documentation. Is
it
not that important? Also, question 2 is related to this one.

* In the function `WalSndGetSyncReplayStateString` all the states are in
lower
case except `UNKNOWN`, is there any particular reason for that?

There are also few more not that important notes (mostly about some typos
and
few confusing names), but I'm going to do another round of review and
testing
anyway so I'll just send them all next time.

#15

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Dmitry Dolgov (#14)

1 attachment(s)

Re: Causal reads take II

On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

I looked through the code of `synchronous-replay-v1.patch` a bit and ran a
few
tests. I didn't manage to break anything, except one mysterious error that
I've
got only once on one of my replicas, but I couldn't reproduce it yet.
Interesting thing is that this error did not affect another replica or
primary.
Just in case here is the log for this error (maybe you can see something
obvious, that I've not noticed):

```
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211":
Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211":
Directory not empty
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
LOG: directories for tablespace 47733 could not be removed
HINT: You can remove the directories manually if necessary.
CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
FATAL: could not create directory "pg_tblspc/47734/PG_10_201707211/47732":
File exists
CONTEXT: WAL redo at 0/125F5768 for Storage/CREATE:
pg_tblspc/47734/PG_10_201707211/47732/47736
LOG: startup process (PID 8034) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down
```

Hmm. The first error ("could not remove directory") could perhaps be
explained by temporary files from concurrent backends, leaked files
from earlier crashes or copying a pgdata directory over the top of an
existing one as a way to set it up, leaving behind some files from an
earlier test? The second error ("could not create directory") is a
bit stranger though... I think this must come from
TablespaceCreateDbspace(): it must have stat()'d the file and got
ENOENT, decided to create the directory, acquired
TablespaceCreateLock, stat()'d the file again and found it still
absent, then run mkdir() on the parents and got EEXIST and finally on
the directory to be created, and surprisingly got EEXIST. That means
that someone must have concurrently created the directory. Perhaps in
your testing you accidentally copied a pgdata directory over the top
of it while it was running? In any case I'm struggling to see how
anything in this patch would affect anything at the REDO level.

And speaking about the code, so far I have just a few notes (some of them
merely questions):

* In general the idea behind this patch sounds interesting for me, but it
relies heavily on time synchronization. As mentioned in the documentation:
"Current hardware clocks, NTP implementations and public time servers are
unlikely to allow the system clocks to differ more than tens or hundreds
of
milliseconds, and systems synchronized with dedicated local time servers
may
be considerably more accurate." But as far as I remember from my own
experience sometimes it maybe not that trivial on something like AWS
because
of virtualization. Maybe it's an unreasonable fear, but is it possible to
address this problem somehow?

Oops, I had managed to lose an important hunk that deals with
detecting excessive clock drift (ie badly configured servers) while
rebasing a couple of versions back. Here is a version to put it back.

With that change, if you disable NTP and manually set your standby's
clock to be more than 1.25s (assuming synchronous_replay_lease_time is
set to the default of 5s) behind the primary, the synchronous_replay
should be unavailable and you should see this error in the standby's
log:

ereport(LOG,
(errmsg("the primary server's clock time is too
far ahead for synchronous_replay"),
errhint("Check your servers' NTP configuration or
equivalent.")));

One way to test this without messing with your NTP setting or
involving two different computers is to modify this code temporarily
in WalSndKeepalive:

now = GetCurrentTimestamp() + 1250100;

This is a best effort intended to detect a system not running ntpd at
all or talking to an insane time server. Fundamentally this proposal
is based on the assumption that you can get your system clocks into
sync within a tolerance that we feel confident estimating an upper
bound for.

It does appear that some Amazon OS images come with NTP disabled;
that's a problem if you want to use this feature, but if you're
running a virtual server without an ntpd you'll pretty soon drift
seconds to minutes off UTC time and get "unavailable for synchronous
replay" errors from this patch (and possibly the LOG message above,
depending on the direction of drift).

* Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
time shift when we're dealing with leases). Does it make sense to move
them
out and make them configurable?

These numbers are interrelated, and I think they're best fixed in that
ratio. You could make it more adjustable, but I think it's better to
keep it simple with just a single knob. Let me restate that logic to
explain how I came up with those ratios. There are two goals:

1. The primary needs to be able to wait for a lease to expire if
connectivity is lost, even if the the standby's clock is behind the
primary's clock by max_clock_skew. (Failure to do so could allow
stale query results from a zombie standby that is somehow still
handling queries after connectivity with the primary is lost.)

2. The primary needs to be able to replace leases often enough so
that there are no gaps between them, even if the standby's clock is
ahead of the primary's clock by max_clock_skew. (Failure to replace
leases fast enough could cause spurious "unavailable" errors, but not
incorrect query results. Technically it's max_clock_skew - network
latency since it takes time for new leases to arrive but that's a
minor detail).

A solution that maximises tolerable clock skew and as an added bonus
doesn't require the standby to have access to the primary's GUCs is to
tell the standby that the expiry time is 25% earlier that the time the
primary will really wait until, and replace leases when they still
have 50% of their time to go.

To illustrate using fixed-width ASCII-art, here's how the primary
perceives the stream of leases it sends. In this diagram, '+' marks
halfway, '!' marks the time the primary will send to the standby as
the expiry time, and the final '|' marks the time the primary will
really wait until if it has to, just in case the standby's clock is
'slow' (behind). The '!' is 25% earlier, and represents the maximum
tolerable clock skew.

|---+-!-|
|---+-!-|
|---+-!-|

Here's how a standby with a clock that is 'slow' (behind) by
max_clock_skew = 25% perceives this stream of leases:

|-------!
|-------!
|-------!

You can see that the primary server is able to wait just long enough
for the lease to expire and the error to begin to be raised on this
standby server, if it needs to.

Here's how a standby with a clock that is 'fast' (ahead) by
max_clock_skew = 25% perceives this stream of leases:

|---!
|---!
|---!

If it's ahead by more than that, we'll get gaps where the error may be
raised spuriously in between leases.

* Judging from the `SyncReplayPotentialStandby` function, it's possible to
have
`synchronous_replay_standby_names = "*, server_name"`, which is basically
an
equivalent for just `*`, but it looks confusing. Is it worth it to prevent
this behaviour?

Hmm. Seems harmless to me!

* In the same function `SyncReplayPotentialStandby` there is this code:

```
if (!SplitIdentifierString(rawstring, ',', &elemlist))
{
/* syntax error in list */
pfree(rawstring);
list_free(elemlist);
/* GUC machinery will have already complained - no need to do again */
return false;
}
```

Am I right that ideally this (a situation when at this point in the code
`synchronous_replay_standby_names` has incorrect value) should not happen,
because GUC will prevent us from that? If yes, then it looks for me that
it
still makes sense to put here a log message, just to give more information
in
a potentially weird situation.

Yes. That's exactly the coding that was used for synchronous_commit,
before it was upgraded to support a new fancy syntax. I was trying to
do things the established way.

* In the function `SyncRepReleaseWaiters` there is a commentary:

```
/*
* If the number of sync standbys is less than requested or we aren't
* managing a sync standby or a standby in synchronous replay state that
* blocks then just leave.
* /
if ((!got_recptr || !am_sync) && !walsender_sr_blocker)
```

Is this commentary correct? If I understand everything right `!got_recptr`
-
the number of sync standbys is less than requested (a), `!am_sync` - we
aren't
managing a sync standby (b), `walsender_sr_blocker` - a standby in
synchronous
replay state that blocks (c). Looks like condition is `(a or b) and not
c`.

This code is trying to decide whether to leave early, rather than
potentially blocking. The change in my patch is:

-       if (!got_recptr || !am_sync)
+       if ((!got_recptr || !am_sync) && !walsender_sr_blocker)

The old coding said "if there aren't enough sync commit standbys, or
I'm not a sync standby, then I can leave now". The coding with my
patch is the same, except that in any case it won't leave early this
walsender is managing a standby that potentially blocks commit. That
said, it's a terribly named and documented function argument, so I
have fixed that in the attached version; I hope that's better!

To put it another way, with this patch there are two different reasons
a transaction might need to wait: because of synchronous_commit and
because of synchronous_replay. They're both forms of 'synchronous
replication' I suppose. That if statement is saying 'if I don't need
to wait for synchronous_commit, and I don't need to wait for
synchronous_replay, then we can return early'.

* In the function `ProcessStandbyReplyMessage` there is a code that
implements
this:

```
* If the standby reports that it has fully replayed the WAL for at
least
* 10 seconds, then let's clear the lag times that were measured when it
* last wrote/flushed/applied a WAL record. This way we avoid displaying
* stale lag data until more WAL traffic arrives.
```
but I never found any mention of this 10 seconds in the documentation. Is
it
not that important? Also, question 2 is related to this one.

Hmm. Yeah that does seem a bit arbitrary. The documentation in
master does already say that it's cleared without being saying exactly
when:

[...] If the standby
server has entirely caught up with the sending server and there is no more
WAL activity, the most recently measured lag times will continue to be
displayed for a short time and then show NULL.

The v1 patch changeed it from being based on
wal_receiver_status_interval (sort of implicitly) to being 10 seconds,
and yeah that is not a good change. In this v2 it's using
wal_receiver_status_interval (though it has to do it explicitly, since
this patch increases the amount of chit chat between the servers due
to lease replacement).

* In the function `WalSndGetSyncReplayStateString` all the states are in
lower
case except `UNKNOWN`, is there any particular reason for that?

This state should never exist, so no user will ever see it; I used
"UNKNOWN" following the convention established by the function
WalSndGetStateString() immediately above.

There are also few more not that important notes (mostly about some typos
and
few confusing names), but I'm going to do another round of review and
testing
anyway so I'll just send them all next time.

Thanks for the review!

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

synchronous-replay-v2.patchapplication/octet-stream; name=synchronous-replay-v2.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b45b7f7f69b..9039e7b29db 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2905,6 +2905,36 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</> set to <literal>on</> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Server(s)</title>
 
@@ -3215,6 +3245,66 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+       <para>
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby, as
+        described in <xref linkend="synchronous-replay">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</>, as described in
+        <xref linkend="synchronous-replay">.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</></>.
+        The default is <literal>*</>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 138bdf2a75d..54e292f7fbd 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1127,7 +1127,7 @@ primary_slot_name = 'node_a_slot'
     cause each commit to wait until the current synchronous standbys report
     that they have replayed the transaction, making it visible to user
     queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    consistency.  See also <xref linkend="synchronous-replay">.
    </para>
 
    <para>
@@ -1325,6 +1325,119 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</> to <literal>on</> in any pair of
+    consecutive transactions tx1, tx2 where tx2 begins after tx1 successfully
+    returns, then tx2 will either see tx1 or fail with a new error "standby is
+    not available for synchronous replay", no matter which server it runs on.
+    Although the guarantee is expressed in terms of two individual
+    transactions, the GUC can also be set at session, role or system level to
+    make the guarantee generally, allowing for load balancing of applications
+    that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the feature, <varname>synchronous_replay_max_lag</>
+    must be set to a non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</> can be used to limit the
+    set of standbys that can join the dynamic set of synchronous replay
+    standbys by providing a comma-separated list of application names.  By
+    default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</></>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</> to <literal>remote_apply</> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</>.  For example,
+    if <varname>synchronous_replay_lease_time</> is set to <literal>5s</>,
+    then the clocks must not be more than 1.25 second apart for the guarantee
+    to be upheld reliably during transitions.  The ubiquity of the Network
+    Time Protocol (NTP) on modern operating systems and availability of high
+    quality time servers makes it possible to choose a tolerance significantly
+    higher than the maximum expected clock difference.  An effort is
+    nevertheless made to detect and report misconfigured and faulty systems
+    with clock differences greater than the configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</> below the default of 5
+     seconds (allowing up to 1.25 second of clock difference) after
+     researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1673,7 +1786,16 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</> with respect to a transaction that
+    has completed on the primary by setting <varname>synchronous_replay</>
+    to <literal>on</> in both transactions.  For more details,
+    see <xref linkend="synchronous-replay">.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index be3dc672bcc..c48243362c2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1790,6 +1790,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</></entry>
+     <entry><type>text</></entry>
+     <entry>Synchronous replay state of this standby server.  This field will be
+     non-null only if <varname>synchronous_replay_max_lag</> is set.  If a standby is
+     in <literal>available</> state, then it can currently serve synchronous replay
+     queries.  If it is not replaying fast enough or not responding to
+     keepalive messages, it will be in <literal>unavailable</> state, and if
+     it is currently transitioning to availability it will be
+     in <literal>joining</> state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b0aa69fe4b4..deb14e346a5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5149,7 +5149,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0fdad0c1197..cc8b565386f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -732,7 +732,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.sync_replay
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a0b0eecbd5e..b3074a6578e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3606,6 +3606,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 		case WAIT_EVENT_LOGICAL_SYNC_DATA:
 			event_name = "LogicalSyncData";
 			break;
@@ -3640,6 +3643,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0d48dfa4947..431205db9b8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1306,6 +1306,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 77e80f16123..f18321293f3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -129,6 +138,229 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_SYNC_REPLAY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
+/*
  * Wait for synchronous replication, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
@@ -149,11 +381,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
@@ -169,6 +399,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -229,57 +465,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgement, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -402,14 +590,65 @@ SyncRepInitConfig(void)
 }
 
 /*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
+/*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -423,13 +662,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -467,9 +708,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -478,24 +720,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -993,9 +1247,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1012,7 +1265,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1079,7 +1332,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1130,6 +1383,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgement, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1199,6 +1510,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ea9d21a46b3..14a971f9822 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -139,9 +140,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -466,7 +468,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -511,7 +513,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -569,7 +571,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -874,6 +876,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -893,7 +897,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -903,7 +907,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -911,15 +916,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1082,7 +1089,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1100,9 +1107,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1149,6 +1159,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1281,15 +1292,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1297,6 +1349,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1334,7 +1388,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 8ed7254b5c6..dec98eb48c8 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -373,3 +374,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9a2babef1e6..96ae40f9e3e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -167,9 +167,23 @@ static StringInfoData tmpbuf;
  */
 static TimestampTz last_reply_timestamp = 0;
 
+static TimestampTz last_keepalive_timestamp = 0;
+
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -239,7 +253,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -281,6 +295,61 @@ InitWalSender(void)
 }
 
 /*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
  * Clean up after an error.
  *
  * WAL sender processes don't use transactions like regular backends do.
@@ -308,7 +377,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -320,6 +392,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1583,6 +1657,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1599,6 +1674,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1648,6 +1724,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1745,9 +1822,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1755,6 +1834,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1769,17 +1849,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1795,8 +1875,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1807,11 +1932,55 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2001,33 +2170,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2043,20 +2231,33 @@ WalSndComputeSleeptime(TimestampTz now)
 /*
  * Check whether there have been responses by the client within
  * wal_sender_timeout and shutdown if not.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && now >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && now >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2084,6 +2285,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	/* Report to pgstat that this process is running */
 	pgstat_report_activity(STATE_RUNNING, NULL);
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2249,6 +2453,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3131,6 +3336,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3150,7 +3376,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3204,6 +3430,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			priority;
 		int			pid;
 		WalSndState state;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3216,6 +3443,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3298,6 +3526,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3313,21 +3544,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3342,23 +3621,35 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (now >= ping_time)
 	{
 		WalSndKeepalive(true);
 		waiting_for_ping_response = true;
+		last_keepalive_timestamp = now;
 
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
@@ -3398,7 +3689,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == LagTracker.read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 4f354717628..d1751f6e0c0 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -307,6 +307,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 82e54c084b8..1832bdf4de0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1647,6 +1647,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
 			NULL
@@ -2885,6 +2895,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3567,6 +3599,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
+	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2b1ebb797ec..e6dbcb58bbd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -250,6 +250,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -279,6 +290,14 @@
 #max_logical_replication_workers = 4	# taken from max_worker_processes
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 08a08c8e8fc..55aef58fcd2 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -332,6 +334,17 @@ GetTransactionSnapshot(void)
 				 "cannot take query snapshot during a parallel operation");
 
 		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
+		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
 		 * make a copy of it rather than returning CurrentSnapshotData
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 6811a55e764..02eaf97247f 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -117,7 +117,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -150,6 +150,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 15932c60b5a..501ecc849d1 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -325,7 +325,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -343,6 +343,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 8b33b4e0ea7..106f87989fb 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2832,7 +2832,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6bffe63ad6b..69e8c5bbc1b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -811,6 +811,7 @@ typedef enum
 	WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
 	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
 } WaitEventIPC;
@@ -825,7 +826,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ceafe2cbea1..e2bc88f7c23 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9a8b2e207ec..bbd7ffaa705 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	TimeLineID	receivedTLI;
 
 	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
+	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
 	 * receivedUpto before the last flush to disk.  Startup process can use
@@ -298,4 +305,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 17c68cba235..35a7fab6733 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -101,6 +113,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2e42b9ec05f..9df755a72ad 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1859,9 +1859,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.sync_replay
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,

#16

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Thomas Munro (#15)

1 attachment(s)

Re: Causal reads take II

On Mon, Jul 31, 2017 at 5:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Here is a version to put it back.

Rebased after conflicting commit 030273b7. Now using format-patch
with a commit message to keep track of review/discussion history.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

synchronous-replay-v3.patchapplication/octet-stream; name=synchronous-replay-v3.patchDownload

From ec07337c067d6ba59c74b6205438fb7b1d1f4615 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 12 Apr 2017 11:02:36 +1200
Subject: [PATCH] Introduce synchronous replay mode to avoid stale reads on hot
 standbys.

While the existing synchronous replication support is mainly concerned with
increasing durability, synchronous replay is concerned with increasing
availability.  When two transactions tx1, tx2 are run with synchronous_replay
set to on and tx1 reports successful commit before tx2 begins, then tx1 is
guaranteed either to see tx1 or to raise a new error 40P02 if it is run on
hot standby.

Compared to the remote_apply feature introduced by commit 314cbfc5,
synchronous replay allows for graceful failure, certainty about which
standbys can provide non-stale reads in multi-standby configurations and a
limit on how much standbys can slow the primary server down.

To make effective use of this feature, clients require some intelligence
to route read-only transactions and to avoid servers that have recently
raised error 40P02.  It is anticipated that application frameworks and
middleware will be able to provide such intelligence so that application code
can remain unaware of whether read transactions are run on different servers.

Heikki Linnakangas and Simon Riggs expressed the view that this approach is
inferior to one based on clients tracking commit LSNs and asking standby
servers to wait for replay, but other reviewers have expressed support for
both approaches being available to users.

Author: Thomas Munro
Reviewed-By: Dmitry Dolgov, Thom Brown, Amit Langote, Simon Riggs,
             Joel Jacobson, Heikki Linnakangas, Michael Paquier, Simon Riggs,
             Robert Haas, Ants Aasma
Discussion: https://postgr.es/m/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  90 +++++
 doc/src/sgml/high-availability.sgml           | 126 ++++++-
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/transam/xact.c             |   2 +-
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |   6 +
 src/backend/replication/logical/worker.c      |   1 +
 src/backend/replication/syncrep.c             | 502 +++++++++++++++++++++-----
 src/backend/replication/walreceiver.c         |  82 ++++-
 src/backend/replication/walreceiverfuncs.c    |  19 +
 src/backend/replication/walsender.c           | 367 +++++++++++++++++--
 src/backend/utils/errcodes.txt                |   1 +
 src/backend/utils/misc/guc.c                  |  43 +++
 src/backend/utils/misc/postgresql.conf.sample |  19 +
 src/backend/utils/time/snapmgr.c              |  13 +
 src/bin/pg_basebackup/pg_recvlogical.c        |   4 +-
 src/bin/pg_basebackup/receivelog.c            |   4 +-
 src/include/catalog/pg_proc.h                 |   2 +-
 src/include/pgstat.h                          |   6 +-
 src/include/replication/syncrep.h             |  16 +-
 src/include/replication/walreceiver.h         |   9 +
 src/include/replication/walsender_private.h   |  20 +
 src/test/regress/expected/rules.out           |   5 +-
 23 files changed, 1203 insertions(+), 148 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c33d6a03492..3c9b2fba6c0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2929,6 +2929,36 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</> set to <literal>on</> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Server(s)</title>
 
@@ -3239,6 +3269,66 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+       <para>
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby, as
+        described in <xref linkend="synchronous-replay">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</>, as described in
+        <xref linkend="synchronous-replay">.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</></>.
+        The default is <literal>*</>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 138bdf2a75d..54e292f7fbd 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1127,7 +1127,7 @@ primary_slot_name = 'node_a_slot'
     cause each commit to wait until the current synchronous standbys report
     that they have replayed the transaction, making it visible to user
     queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    consistency.  See also <xref linkend="synchronous-replay">.
    </para>
 
    <para>
@@ -1325,6 +1325,119 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</> to <literal>on</> in any pair of
+    consecutive transactions tx1, tx2 where tx2 begins after tx1 successfully
+    returns, then tx2 will either see tx1 or fail with a new error "standby is
+    not available for synchronous replay", no matter which server it runs on.
+    Although the guarantee is expressed in terms of two individual
+    transactions, the GUC can also be set at session, role or system level to
+    make the guarantee generally, allowing for load balancing of applications
+    that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the feature, <varname>synchronous_replay_max_lag</>
+    must be set to a non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</> can be used to limit the
+    set of standbys that can join the dynamic set of synchronous replay
+    standbys by providing a comma-separated list of application names.  By
+    default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</></>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</> to <literal>remote_apply</> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</>.  For example,
+    if <varname>synchronous_replay_lease_time</> is set to <literal>5s</>,
+    then the clocks must not be more than 1.25 second apart for the guarantee
+    to be upheld reliably during transitions.  The ubiquity of the Network
+    Time Protocol (NTP) on modern operating systems and availability of high
+    quality time servers makes it possible to choose a tolerance significantly
+    higher than the maximum expected clock difference.  An effort is
+    nevertheless made to detect and report misconfigured and faulty systems
+    with clock differences greater than the configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</> below the default of 5
+     seconds (allowing up to 1.25 second of clock difference) after
+     researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1673,7 +1786,16 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</> with respect to a transaction that
+    has completed on the primary by setting <varname>synchronous_replay</>
+    to <literal>on</> in both transactions.  For more details,
+    see <xref linkend="synchronous-replay">.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 12d56282669..076080bcdf1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1822,6 +1822,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</></entry>
+     <entry><type>text</></entry>
+     <entry>Synchronous replay state of this standby server.  This field will be
+     non-null only if <varname>synchronous_replay_max_lag</> is set.  If a standby is
+     in <literal>available</> state, then it can currently serve synchronous replay
+     queries.  If it is not replaying fast enough or not responding to
+     keepalive messages, it will be in <literal>unavailable</> state, and if
+     it is currently transitioning to availability it will be
+     in <literal>joining</> state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 50c3c3b5e5e..4c4cbb5389b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5158,7 +5158,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index dc40cde4240..819e02fc880 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -732,7 +732,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.sync_replay
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1f75e2e97d0..6c5e1ee97fa 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3621,6 +3621,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
@@ -3649,6 +3652,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7c2df576457..da6a4ff0463 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1307,6 +1307,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 77e80f16123..f18321293f3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -129,6 +138,229 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_SYNC_REPLAY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
+/*
  * Wait for synchronous replication, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
@@ -149,11 +381,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
@@ -169,6 +399,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -229,57 +465,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgement, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -402,14 +590,65 @@ SyncRepInitConfig(void)
 }
 
 /*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
+/*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -423,13 +662,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -467,9 +708,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -478,24 +720,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -993,9 +1247,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1012,7 +1265,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1079,7 +1332,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1130,6 +1383,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgement, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1199,6 +1510,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ea9d21a46b3..14a971f9822 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -139,9 +140,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -466,7 +468,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -511,7 +513,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -569,7 +571,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -874,6 +876,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -893,7 +897,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -903,7 +907,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -911,15 +916,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1082,7 +1089,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1100,9 +1107,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1149,6 +1159,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1281,15 +1292,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1297,6 +1349,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1334,7 +1388,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 8ed7254b5c6..dec98eb48c8 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -373,3 +374,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9a2babef1e6..96ae40f9e3e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -167,9 +167,23 @@ static StringInfoData tmpbuf;
  */
 static TimestampTz last_reply_timestamp = 0;
 
+static TimestampTz last_keepalive_timestamp = 0;
+
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -239,7 +253,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -281,6 +295,61 @@ InitWalSender(void)
 }
 
 /*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
  * Clean up after an error.
  *
  * WAL sender processes don't use transactions like regular backends do.
@@ -308,7 +377,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -320,6 +392,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1583,6 +1657,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1599,6 +1674,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1648,6 +1724,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1745,9 +1822,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1755,6 +1834,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1769,17 +1849,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1795,8 +1875,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1807,11 +1932,55 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2001,33 +2170,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2043,20 +2231,33 @@ WalSndComputeSleeptime(TimestampTz now)
 /*
  * Check whether there have been responses by the client within
  * wal_sender_timeout and shutdown if not.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && now >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && now >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2084,6 +2285,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	/* Report to pgstat that this process is running */
 	pgstat_report_activity(STATE_RUNNING, NULL);
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2249,6 +2453,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3131,6 +3336,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3150,7 +3376,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3204,6 +3430,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			priority;
 		int			pid;
 		WalSndState state;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3216,6 +3443,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3298,6 +3526,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3313,21 +3544,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3342,23 +3621,35 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (now >= ping_time)
 	{
 		WalSndKeepalive(true);
 		waiting_for_ping_response = true;
+		last_keepalive_timestamp = now;
 
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
@@ -3398,7 +3689,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == LagTracker.read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 4f354717628..d1751f6e0c0 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -307,6 +307,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 246fea8693b..63e0619bf23 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1647,6 +1647,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
 			NULL
@@ -2885,6 +2895,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3567,6 +3599,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
+	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index df5d2f3f22f..8b5276a44cd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -252,6 +252,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -282,6 +293,14 @@
 					# (change requires restart)
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 08a08c8e8fc..55aef58fcd2 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -332,6 +334,17 @@ GetTransactionSnapshot(void)
 				 "cannot take query snapshot during a parallel operation");
 
 		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
+		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
 		 * make a copy of it rather than returning CurrentSnapshotData
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 6811a55e764..02eaf97247f 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -117,7 +117,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -150,6 +150,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 888458f4a90..904bd605935 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -327,7 +327,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -345,6 +345,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 8b33b4e0ea7..106f87989fb 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2832,7 +2832,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index cb05d9b81e5..312766a495e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -815,7 +815,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY
 } WaitEventIPC;
 
 /* ----------
@@ -828,7 +829,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ceafe2cbea1..e2bc88f7c23 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9a8b2e207ec..bbd7ffaa705 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	TimeLineID	receivedTLI;
 
 	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
+	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
 	 * receivedUpto before the last flush to disk.  Startup process can use
@@ -298,4 +305,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 17c68cba235..35a7fab6733 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -101,6 +113,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index d582bc9ee44..85be4978731 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1859,9 +1859,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.sync_replay
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.13.2

#17

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Thomas Munro (#16)

1 attachment(s)

Re: Causal reads take II

On Thu, Aug 10, 2017 at 2:02 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Rebased after conflicting commit 030273b7. Now using format-patch
with a commit message to keep track of review/discussion history.

TAP test 006_logical_decoding.pl failed with that version. I had
missed some places that know how to decode wire protocol messages I
modified. Fixed in the attached version.

It might be a good idea to consolidate the message encoding/decoding
logic into reusable routines, independently of this work.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

synchronous-replay-v4.patchapplication/octet-stream; name=synchronous-replay-v4.patchDownload

From 64848ce287f5c7ffa78d37c5bc9fbcceb1b2a25d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 12 Apr 2017 11:02:36 +1200
Subject: [PATCH] Introduce synchronous replay mode to avoid stale reads on hot
 standbys.

While the existing synchronous replication support is mainly concerned with
increasing durability, synchronous replay is concerned with increasing
availability.  When two transactions tx1, tx2 are run with synchronous_replay
set to on and tx1 reports successful commit before tx2 begins, then tx1 is
guaranteed either to see tx1 or to raise a new error 40P02 if it is run on
hot standby.

Compared to the remote_apply feature introduced by commit 314cbfc5,
synchronous replay allows for graceful failure, certainty about which
standbys can provide non-stale reads in multi-standby configurations and a
limit on how much standbys can slow the primary server down.

To make effective use of this feature, clients require some intelligence
to route read-only transactions and to avoid servers that have recently
raised error 40P02.  It is anticipated that application frameworks and
middleware will be able to provide such intelligence so that application code
can remain unaware of whether read transactions are run on different servers.

Heikki Linnakangas and Simon Riggs expressed the view that this approach is
inferior to one based on clients tracking commit LSNs and asking standby
servers to wait for replay, but other reviewers have expressed support for
both approaches being available to users.

Author: Thomas Munro
Reviewed-By: Dmitry Dolgov, Thom Brown, Amit Langote, Simon Riggs,
             Joel Jacobson, Heikki Linnakangas, Michael Paquier, Simon Riggs,
             Robert Haas, Ants Aasma
Discussion: https://postgr.es/m/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  90 +++++
 doc/src/sgml/high-availability.sgml           | 126 ++++++-
 doc/src/sgml/monitoring.sgml                  |  11 +
 src/backend/access/transam/xact.c             |   2 +-
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |   6 +
 src/backend/replication/logical/worker.c      |   2 +
 src/backend/replication/syncrep.c             | 502 +++++++++++++++++++++-----
 src/backend/replication/walreceiver.c         |  82 ++++-
 src/backend/replication/walreceiverfuncs.c    |  19 +
 src/backend/replication/walsender.c           | 367 +++++++++++++++++--
 src/backend/utils/errcodes.txt                |   1 +
 src/backend/utils/misc/guc.c                  |  43 +++
 src/backend/utils/misc/postgresql.conf.sample |  19 +
 src/backend/utils/time/snapmgr.c              |  13 +
 src/bin/pg_basebackup/pg_recvlogical.c        |   6 +-
 src/bin/pg_basebackup/receivelog.c            |   5 +-
 src/include/catalog/pg_proc.h                 |   2 +-
 src/include/pgstat.h                          |   6 +-
 src/include/replication/syncrep.h             |  16 +-
 src/include/replication/walreceiver.h         |   9 +
 src/include/replication/walsender_private.h   |  20 +
 src/test/regress/expected/rules.out           |   5 +-
 23 files changed, 1207 insertions(+), 148 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5f59a382f18..77d565175ef 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2921,6 +2921,36 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</> set to <literal>on</> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Server(s)</title>
 
@@ -3231,6 +3261,66 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+       <para>
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby, as
+        described in <xref linkend="synchronous-replay">.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</>, as described in
+        <xref linkend="synchronous-replay">.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</></>.
+        The default is <literal>*</>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6c54fbd40d8..41e55894ad7 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1156,7 +1156,7 @@ primary_slot_name = 'node_a_slot'
     cause each commit to wait until the current synchronous standbys report
     that they have replayed the transaction, making it visible to user
     queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    consistency.  See also <xref linkend="synchronous-replay">.
    </para>
 
    <para>
@@ -1354,6 +1354,119 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</> to <literal>on</> in any pair of
+    consecutive transactions tx1, tx2 where tx2 begins after tx1 successfully
+    returns, then tx2 will either see tx1 or fail with a new error "standby is
+    not available for synchronous replay", no matter which server it runs on.
+    Although the guarantee is expressed in terms of two individual
+    transactions, the GUC can also be set at session, role or system level to
+    make the guarantee generally, allowing for load balancing of applications
+    that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the feature, <varname>synchronous_replay_max_lag</>
+    must be set to a non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</> can be used to limit the
+    set of standbys that can join the dynamic set of synchronous replay
+    standbys by providing a comma-separated list of application names.  By
+    default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</></>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</> to <literal>remote_apply</> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</>.  For example,
+    if <varname>synchronous_replay_lease_time</> is set to <literal>5s</>,
+    then the clocks must not be more than 1.25 second apart for the guarantee
+    to be upheld reliably during transitions.  The ubiquity of the Network
+    Time Protocol (NTP) on modern operating systems and availability of high
+    quality time servers makes it possible to choose a tolerance significantly
+    higher than the maximum expected clock difference.  An effort is
+    nevertheless made to detect and report misconfigured and faulty systems
+    with clock differences greater than the configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</> below the default of 5
+     seconds (allowing up to 1.25 second of clock difference) after
+     researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1702,7 +1815,16 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</> with respect to a transaction that
+    has completed on the primary by setting <varname>synchronous_replay</>
+    to <literal>on</> in both transactions.  For more details,
+    see <xref linkend="synchronous-replay">.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 38bf63658ae..4cd4feea50f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1826,6 +1826,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</></entry>
+     <entry><type>text</></entry>
+     <entry>Synchronous replay state of this standby server.  This field will be
+     non-null only if <varname>synchronous_replay_max_lag</> is set.  If a standby is
+     in <literal>available</> state, then it can currently serve synchronous replay
+     queries.  If it is not replaying fast enough or not responding to
+     keepalive messages, it will be in <literal>unavailable</> state, and if
+     it is currently transitioning to availability it will be
+     in <literal>joining</> state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5e7e8122003..454da068b79 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5172,7 +5172,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index dc40cde4240..819e02fc880 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -732,7 +732,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.sync_replay
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index accf302cf73..5a03b617fdb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3624,6 +3624,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
@@ -3652,6 +3655,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 041f3873b93..214288c9925 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1100,6 +1100,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						TimestampTz timestamp;
 						bool		reply_requested;
 
+						(void) pq_getmsgint64(&s); /* skip messageNumber */
 						end_lsn = pq_getmsgint64(&s);
 						timestamp = pq_getmsgint64(&s);
 						reply_requested = pq_getmsgbyte(&s);
@@ -1307,6 +1308,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 8677235411c..cd596b0afb8 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -129,6 +138,229 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_SYNC_REPLAY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
+/*
  * Wait for synchronous replication, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
@@ -149,11 +381,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
@@ -169,6 +399,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -229,57 +465,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgement, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -402,14 +590,65 @@ SyncRepInitConfig(void)
 }
 
 /*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
+/*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -423,13 +662,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -467,9 +708,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -478,24 +720,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -993,9 +1247,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1012,7 +1265,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1079,7 +1332,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1130,6 +1383,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgement, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1199,6 +1510,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ea9d21a46b3..14a971f9822 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -139,9 +140,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -466,7 +468,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -511,7 +513,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -569,7 +571,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -874,6 +876,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -893,7 +897,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -903,7 +907,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -911,15 +916,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1082,7 +1089,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1100,9 +1107,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1149,6 +1159,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1281,15 +1292,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1297,6 +1349,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1334,7 +1388,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 8ed7254b5c6..dec98eb48c8 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -373,3 +374,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index db346e6edbd..441e6ddc25d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -167,9 +167,23 @@ static StringInfoData tmpbuf;
  */
 static TimestampTz last_reply_timestamp = 0;
 
+static TimestampTz last_keepalive_timestamp = 0;
+
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -239,7 +253,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -281,6 +295,61 @@ InitWalSender(void)
 }
 
 /*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
  * Clean up after an error.
  *
  * WAL sender processes don't use transactions like regular backends do.
@@ -308,7 +377,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -320,6 +392,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1583,6 +1657,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1599,6 +1674,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1648,6 +1724,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1745,9 +1822,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1755,6 +1834,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1769,17 +1849,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1795,8 +1875,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1807,11 +1932,55 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2001,33 +2170,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2043,20 +2231,33 @@ WalSndComputeSleeptime(TimestampTz now)
 /*
  * Check whether there have been responses by the client within
  * wal_sender_timeout and shutdown if not.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && now >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && now >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2084,6 +2285,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	/* Report to pgstat that this process is running */
 	pgstat_report_activity(STATE_RUNNING, NULL);
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2249,6 +2453,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3131,6 +3336,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3150,7 +3376,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3204,6 +3430,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			priority;
 		int			pid;
 		WalSndState state;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3216,6 +3443,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3298,6 +3526,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3313,21 +3544,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3342,23 +3621,35 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (now >= ping_time)
 	{
 		WalSndKeepalive(true);
 		waiting_for_ping_response = true;
+		last_keepalive_timestamp = now;
 
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
@@ -3398,7 +3689,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == LagTracker.read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 4f354717628..d1751f6e0c0 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -307,6 +307,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 246fea8693b..63e0619bf23 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1647,6 +1647,16 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
 			NULL
@@ -2885,6 +2895,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3567,6 +3599,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
+	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index df5d2f3f22f..8b5276a44cd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -252,6 +252,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -282,6 +293,14 @@
 					# (change requires restart)
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 08a08c8e8fc..55aef58fcd2 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -332,6 +334,17 @@ GetTransactionSnapshot(void)
 				 "cannot take query snapshot during a parallel operation");
 
 		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
+		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
 		 * make a copy of it rather than returning CurrentSnapshotData
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 6811a55e764..c4a3583b672 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -117,7 +117,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -150,6 +150,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
@@ -469,6 +471,8 @@ StreamLogicalLog(void)
 			 * rest.
 			 */
 			pos = 1;			/* skip msgtype 'k' */
+			pos += 8;			/* skip messageNumber */
+
 			walEnd = fe_recvint64(&copybuf[pos]);
 			output_written_lsn = Max(walEnd, output_written_lsn);
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 888458f4a90..29c67c5e63a 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -327,7 +327,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -345,6 +345,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
@@ -1032,6 +1034,7 @@ ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 	 * check if the server requested a reply, and ignore the rest.
 	 */
 	pos = 1;					/* skip msgtype 'k' */
+	pos += 8;					/* skip messageNumber */
 	pos += 8;					/* skip walEnd */
 	pos += 8;					/* skip sendTime */
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d820b56aa1b..d909ae0f2b7 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2874,7 +2874,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57ac5d41e46..e23d4269bb6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY
 } WaitEventIPC;
 
 /* ----------
@@ -829,7 +830,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ceafe2cbea1..e2bc88f7c23 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9a8b2e207ec..bbd7ffaa705 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	TimeLineID	receivedTLI;
 
 	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
+	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
 	 * receivedUpto before the last flush to disk.  Startup process can use
@@ -298,4 +305,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 17c68cba235..35a7fab6733 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -101,6 +113,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index d582bc9ee44..85be4978731 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1859,9 +1859,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.sync_replay
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.13.2

#18

Dmitry Dolgov

9erthalion6@gmail.com

over 8 years ago

In reply to: Thomas Munro (#17)

Re: Causal reads take II

On 31 July 2017 at 07:49, Thomas Munro <thomas.munro@enterprisedb.com>

wrote:

On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6@gmail.com>

wrote:

I looked through the code of `synchronous-replay-v1.patch` a bit and ran

a few

tests. I didn't manage to break anything, except one mysterious error

that I've

got only once on one of my replicas, but I couldn't reproduce it yet.
Interesting thing is that this error did not affect another replica or

primary.

Just in case here is the log for this error (maybe you can see something
obvious, that I've not noticed):

LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
Directory not empty
...

Hmm. The first error ("could not remove directory") could perhaps be
explained by temporary files from concurrent backends.
...
Perhaps in your testing you accidentally copied a pgdata directory over

the

top of it while it was running? In any case I'm struggling to see how
anything in this patch would affect anything at the REDO level.

Hmm...no, I don't think so. Basically what I was doing is just running
`installcheck` against a primary instance (I assume there is nothing wrong
with
this approach, am I right?). This particular error was caused by
`tablespace`
test which was failed in this case:

```
INSERT INTO testschema.foo VALUES(1);
ERROR: could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390":
No such file or directory
```

I tried few more times, and I've got it two times from four attempts on a
fresh
installation (when all instances were on the same machine). But anyway I'll
try
to investigate, maybe it has something to do with my environment.

* Also I noticed that some time-related values are hardcoded (e.g.

50%/25%

time shift when we're dealing with leases). Does it make sense to move
them out and make them configurable?

These numbers are interrelated, and I think they're best fixed in that
ratio. You could make it more adjustable, but I think it's better to
keep it simple with just a single knob.

Ok, but what do you think about converting them to constants to make them
more
self explanatory? Like:

```
/*
+ * Since this timestamp is being sent to the standby where it will be
+ * compared against a time generated by the standby's system clock, we
+ * must consider clock skew.  We use 25% of the lease time as max
+ * clock skew, and we subtract that from the time we send with the
+ * following reasoning:
+ */
+int max_clock_skew = synchronous_replay_lease_time *
MAX_CLOCK_SKEW_PORTION;
```

Also I have another question. I tried to test this patch little bit more,
and
I've got some strange behaviour after pgbench (here is the full output [1]https://gist.github.com/erthalion/cdc9357f7437171192348239eb4db764):

```
# primary

$ ./bin/pgbench -s 100 -i test

NOTICE: table "pgbench_history" does not exist, skipping
NOTICE: table "pgbench_tellers" does not exist, skipping
NOTICE: table "pgbench_accounts" does not exist, skipping
NOTICE: table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 10000000 tuples (1%) done (elapsed 0.11 s, remaining 10.50 s)
200000 of 10000000 tuples (2%) done (elapsed 1.06 s, remaining 52.00 s)
300000 of 10000000 tuples (3%) done (elapsed 1.88 s, remaining 60.87 s)
2017-09-30 15:47:26.884 CEST [6035] LOG: revoking synchronous replay lease
for standby "walreceiver"...
2017-09-30 15:47:26.900 CEST [6035] LOG: standby "walreceiver" is no
longer available for synchronous replay
2017-09-30 15:47:26.903 CEST [6197] LOG: revoking synchronous replay lease
for standby "walreceiver"...
400000 of 10000000 tuples (4%) done (elapsed 2.44 s, remaining 58.62 s)
2017-09-30 15:47:27.979 CEST [6197] LOG: standby "walreceiver" is no
longer available for synchronous replay
```

```
# replica

2017-09-30 15:47:51.802 CEST [6034] FATAL: could not receive data from WAL
stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017-09-30 15:47:55.154 CEST [6030] LOG: invalid magic number 0000 in log
segment 000000010000000000000020, offset 10092544
2017-09-30 15:47:55.257 CEST [10508] LOG: started streaming WAL from
primary at 0/20000000 on timeline 1
2017-09-30 15:48:09.622 CEST [10508] FATAL: could not receive data from
WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```

Is it something well known or unrelated to the patch itself?

[1]: https://gist.github.com/erthalion/cdc9357f7437171192348239eb4db764

#19

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Dmitry Dolgov (#18)

Re: Causal reads take II

On Sun, Oct 1, 2017 at 9:05 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
Directory not empty
...

Hmm. The first error ("could not remove directory") could perhaps be
explained by temporary files from concurrent backends.
...
Perhaps in your testing you accidentally copied a pgdata directory over
the
top of it while it was running? In any case I'm struggling to see how
anything in this patch would affect anything at the REDO level.

Hmm...no, I don't think so. Basically what I was doing is just running
`installcheck` against a primary instance (I assume there is nothing wrong
with
this approach, am I right?). This particular error was caused by
`tablespace`
test which was failed in this case:

```
INSERT INTO testschema.foo VALUES(1);
ERROR: could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390":
No such file or directory
```

I tried few more times, and I've got it two times from four attempts on a
fresh
installation (when all instances were on the same machine). But anyway I'll
try
to investigate, maybe it has something to do with my environment.

...

2017-09-30 15:47:55.154 CEST [6030] LOG: invalid magic number 0000 in log
segment 000000010000000000000020, offset 10092544

Hi Dmitry,

Thanks for testing. Yeah, it looks like the patch may be corrupting
the WAL stream in some case that I didn't hit in my own testing
procedure. I will try to reproduce these failures.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Thomas Munro

thomas.munro@enterprisedb.com

about 8 years ago

In reply to: Thomas Munro (#19)

Re: Causal reads take II

On Sun, Oct 1, 2017 at 10:03 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I tried few more times, and I've got it two times from four attempts on a
fresh
installation (when all instances were on the same machine). But anyway I'll
try
to investigate, maybe it has something to do with my environment.

...

2017-09-30 15:47:55.154 CEST [6030] LOG: invalid magic number 0000 in log
segment 000000010000000000000020, offset 10092544

Hi Dmitry,

Thanks for testing. Yeah, it looks like the patch may be corrupting
the WAL stream in some case that I didn't hit in my own testing
procedure. I will try to reproduce these failures.

Hi Dmitry,

I managed to reproduce something like this on one of my home lab
machines running a different OS. Not sure why yet and it doesn't
happen on my primary development box which is how I hadn't noticed it.
I will investigate and aim to get a fix posted in time for the
Commitfest. I'm also hoping to corner Simon at PGDay Australia in a
couple of weeks to discuss this proposal...

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Thomas Munro (#20)

Re: [HACKERS] Causal reads take II

On Sat, Oct 28, 2017 at 6:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I managed to reproduce something like this on one of my home lab
machines running a different OS. Not sure why yet and it doesn't
happen on my primary development box which is how I hadn't noticed it.
I will investigate and aim to get a fix posted in time for the
Commitfest. I'm also hoping to corner Simon at PGDay Australia in a
couple of weeks to discuss this proposal...

This leads me to think that returned with feedback is adapted for now.
So done this way. Feel free to correct things if you think this is not
adapted of course.
--
Michael

#22

Thomas Munro

thomas.munro@enterprisedb.com

about 8 years ago

In reply to: Michael Paquier (#21)

Re: [HACKERS] Causal reads take II

On Wed, Nov 29, 2017 at 2:58 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Oct 28, 2017 at 6:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I managed to reproduce something like this on one of my home lab
machines running a different OS. Not sure why yet and it doesn't
happen on my primary development box which is how I hadn't noticed it.
I will investigate and aim to get a fix posted in time for the
Commitfest. I'm also hoping to corner Simon at PGDay Australia in a
couple of weeks to discuss this proposal...

This leads me to think that returned with feedback is adapted for now.
So done this way. Feel free to correct things if you think this is not
adapted of course.

Thanks. I'll be back.

--
Thomas Munro
http://www.enterprisedb.com

#23

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Thomas Munro (#22)

Re: [HACKERS] Causal reads take II

On Wed, Nov 29, 2017 at 11:04 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Wed, Nov 29, 2017 at 2:58 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Oct 28, 2017 at 6:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I managed to reproduce something like this on one of my home lab
machines running a different OS. Not sure why yet and it doesn't
happen on my primary development box which is how I hadn't noticed it.
I will investigate and aim to get a fix posted in time for the
Commitfest. I'm also hoping to corner Simon at PGDay Australia in a
couple of weeks to discuss this proposal...

This leads me to think that returned with feedback is adapted for now.
So done this way. Feel free to correct things if you think this is not
adapted of course.

Thanks. I'll be back.

Please do not target Sarah Connor then.
(Sorry, too many patches.)
--
Michael