Synchronous replay take III

Started by Thomas Munroalmost 8 years ago19 messages
#1Thomas Munro
thomas.munro@enterprisedb.com
1 attachment(s)

Hi hackers,

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

In previous threads[1]/messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com[2]/messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com /messages/by-id/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com[3]/messages/by-id/CA+CSw_tz0q+FQsqh7Zx7xxF99Jm98VaAWGdEP592e7a+zkD_Mw@mail.gmail.com I called this feature proposal "causal
reads". That was a terrible name, borrowed from MySQL. While it is
probably a useful term of art, for one thing people kept reading it as
"casual", which it ain't, and more importantly this patch is only one
way to achieve read-follows-write causal consistency. Several others
are proposed or exist in forks (user managed wait-for-LSN, global
transaction manager, ...).

OVERVIEW

For writers, it works a bit like RAID mirroring: when you commit a
write transaction, it waits until the data has become visible on all
elements of the array, and if an array element is not responding fast
enough it is kicked out of the array. For readers, it's a little
different because you're connected directly to the array elements
(rather than going through a central controller), so it uses a system
of leases allowing read transactions to know instantly and whether
they are running on an element that is currently in the array and are
therefore able to service synchronous_replay transactions, or should
raise an error telling you to go and ask some other element.

This is a design choice favouring read-mostly workloads at the expense
of write transactions. Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server. This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

The guarantee is: When two transactions tx1, tx2 are run with
synchronous_replay set to on and tx1 reports successful commit before
tx2 begins, then tx1 is guaranteed either to see tx1 or to raise a new
error 40P02 if it is run on a hot standby. I have joked that that
error means "snapshot too young". You could handle it the same way
you handle deadlocks and serialization failures: by retrying, except
in this case you might want to avoid that node for a while.

Note that this feature is concerned with transaction visibility. It
is not concerned with transaction durability. It will happily kick
all of your misbehaving or slow standbys out of the array so that you
fall back to single-node commit durability. You can express your
durability requirement (ie I must have have N copies of the data on
disk before I tell any external party about a transaction) separately,
by configuring regular synchronous replication alongside this feature.
I suspect that this feature would be most popular with people who are
already using regular synchronous replication though, because they
already tolerate higher commit latency.

STATUS

Here's a quick summary of the status of this proposal as I see it:

* Simon Riggs, as the committer most concerned with the areas this
proposal touches -- namely streaming replication and specifically
syncrep -- has not so far appeared to be convinced by the value of
this approach, and has expressed a preference for pursuing client-side
or middleware tracked LSN tokens exclusively. I am perceptive enough
to see that failing to sell the idea to Simon is probably fatal to the
proposal. The main task therefore is to show convincingly that there
is a real use case for this high-level design and its set of
trade-offs, and that it justifies its maintenance burden.

* I have tried to show that there are already many users who route
their read-only queries to hot standby databases (not just "reporting
queries"), and libraries and tools to help people do that using
heuristics like "logged in users need fresh data, so primary only" or
"this session has written in the past N minutes, so primary only".
This proposal would provide a way for those users to do something
based on a guarantee instead of such flimsy heuristics. I have tried
to show that the libraries used by Python, Ruby, Java etc to achieve
that sort of load balancing should easily be able to handle finding
read-only nodes, routing read-only queries and dealing with the new
error. I do also acknowledge that such libraries could also be used
to provide transparent read-my-writes support by tracking LSNs and
injecting wait-for-LSN directives with alternative proposals, but that
is weaker than a global reads-follow-writes guarantee and the
difference can matter.

* I have argued that token-based systems are in fact rather
complicated[4]/messages/by-id/CAEepm=0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ@mail.gmail.com and by no means a panacea. As usual, there are a whole
bunch of trade-offs. I suspect that this proposal AND fully
user-managed causality tokens (no middleware) are both valuable sweet
spots for a non-GTM system.

* Ants Aasma pointed out that this proposal doesn't provide a
read-follows-read guarantee. He is right, and I'm not sure to what
extent that is a problem, but I also think token-based systems can
probably only solve it with fairly high costs.

* Dmitry Dolgov reported a bug causing the replication protocol to get
corrupted on some OSs but not others[5]/messages/by-id/CAEepm=352uctNiFoN84UN4gtunbeTK-PBLouVe8i_b8ZPcJQFQ@mail.gmail.com; could be uninitialised data
or size/padding/layout thinko or other stupid problem. (Gee, it would
be nice if the wire protocol writing and reading code were in reusable
functions instead of open-coded in multiple places... the bug could
be due to that). Unfortunately I haven't managed to track it down yet
and haven't had time to get back to this in time for the Commitfest
due to other work. Given the interest expressed by a reviewer to test
this, which might result in that problem being figured out, I figured
I might as well post the rebased patch anyway, and I will also have
another look soon.

* As Andres Freund pointed out, this currently lacks tests. It should
be fairly easy to add TAP tests to exercise this code, in the style of
the existing tests for replication.

TESTING NOTES

Set up some hot standbys, put synchronous_replay_max_lag = 2s in the
primary's postgresql.conf, then set synchronous_replay = on in every
postgresql.conf or at least in every session that you want to test
with. Then generate various write workloads and observe the primary
server's log as the leases are grant and revoke, or check the status
in pg_stat_replication's replay_lag and sync_replay columns. Verify
that you can't successfully run synchronous_replay = on transactions
on standbys that don't currently have a lease, and that you can't
trick it by cutting your network cables with scissors or killing
random processes etc. You might want to verify my claims about clock
drift and the synchronous_replay_lease_time, either mathematically or
experimentally.

Thanks for reading!

[1]: /messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
[2]: /messages/by-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com /messages/by-id/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
/messages/by-id/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
[3]: /messages/by-id/CA+CSw_tz0q+FQsqh7Zx7xxF99Jm98VaAWGdEP592e7a+zkD_Mw@mail.gmail.com
[4]: /messages/by-id/CAEepm=0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ@mail.gmail.com
[5]: /messages/by-id/CAEepm=352uctNiFoN84UN4gtunbeTK-PBLouVe8i_b8ZPcJQFQ@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Synchronous-replay-mode-for-avoiding-stale-reads--v5.patchapplication/octet-stream; name=0001-Synchronous-replay-mode-for-avoiding-stale-reads--v5.patchDownload
From be75f23de5ceaddbae4c806ff61e68987d9ac7b3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 12 Apr 2017 11:02:36 +1200
Subject: [PATCH] Synchronous replay mode for avoiding stale reads on hot
 standbys.

While the existing synchronous replication support is mainly concerned with
increasing durability, synchronous replay is concerned with increasing
availability.  When two transactions tx1, tx2 are run with synchronous_replay
set to on and tx1 reports successful commit before tx2 begins, then tx1 is
guaranteed either to see tx1 or to raise a new error 40P02 if it is run on a
hot standby.

Compared to the remote_apply feature introduced by commit 314cbfc5,
synchronous replay allows for graceful failure, certainty about which
standbys can provide non-stale reads in multi-standby configurations and a
limit on how much standbys can slow the primary server down.

To make effective use of this feature, clients require some intelligence
to route read-only transactions and to avoid servers that have recently
raised error 40P02.  It is anticipated that application frameworks and
middleware will be able to provide such intelligence so that application code
can remain unaware of whether read transactions are run on different servers.

Heikki Linnakangas and Simon Riggs expressed the view that this approach is
inferior to one based on clients tracking commit LSNs and asking standby
servers to wait for replay, but other reviewers have expressed support for
both approaches being available to users.

Author: Thomas Munro
Reviewed-By: Dmitry Dolgov, Thom Brown, Amit Langote, Simon Riggs,
             Joel Jacobson, Heikki Linnakangas, Michael Paquier, Simon Riggs,
             Robert Haas, Ants Aasma
Discussion: https://postgr.es/m/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  87 +++++
 doc/src/sgml/high-availability.sgml           | 139 ++++++-
 doc/src/sgml/monitoring.sgml                  |  12 +
 src/backend/access/transam/xact.c             |   2 +-
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |   6 +
 src/backend/replication/logical/worker.c      |   2 +
 src/backend/replication/syncrep.c             | 502 +++++++++++++++++++++-----
 src/backend/replication/walreceiver.c         |  82 ++++-
 src/backend/replication/walreceiverfuncs.c    |  19 +
 src/backend/replication/walsender.c           | 367 +++++++++++++++++--
 src/backend/utils/errcodes.txt                |   1 +
 src/backend/utils/misc/guc.c                  |  43 +++
 src/backend/utils/misc/postgresql.conf.sample |  19 +
 src/backend/utils/time/snapmgr.c              |  13 +
 src/bin/pg_basebackup/pg_recvlogical.c        |   6 +-
 src/bin/pg_basebackup/receivelog.c            |   5 +-
 src/include/catalog/pg_proc.h                 |   2 +-
 src/include/pgstat.h                          |   6 +-
 src/include/replication/syncrep.h             |  16 +-
 src/include/replication/walreceiver.h         |   9 +
 src/include/replication/walsender_private.h   |  20 +
 src/test/regress/expected/rules.out           |   5 +-
 23 files changed, 1214 insertions(+), 152 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 00fc364c0ab..cf2a4c024d8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2956,6 +2956,36 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</varname> set to <literal>on</literal> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"/> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Server(s)</title>
 
@@ -3266,6 +3296,63 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby.
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</firstterm>, as described in
+        <xref linkend="synchronous-replay"/>.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</literal></link>.
+        The default is <literal>*</literal>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</varname>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 46bf198a2ac..55bc948f945 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1158,11 +1158,12 @@ primary_slot_name = 'node_a_slot'
    </para>
 
    <para>
-    Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
-    cause each commit to wait until the current synchronous standbys report
-    that they have replayed the transaction, making it visible to user
-    queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    Setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> will cause each commit to wait until
+    the current synchronous standbys report that they have replayed the
+    transaction, making it visible to user queries.  In simple cases, this
+    allows for load balancing with causal consistency.  See also
+    <xref linkend="synchronous-replay"/>.
    </para>
 
    <para>
@@ -1360,6 +1361,122 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</varname> to <literal>on</literal> in any
+    pair of consecutive transactions tx1, tx2 where tx2 begins after tx1
+    successfully returns, then tx2 will either see tx1 or fail with a new
+    error "standby is not available for synchronous replay", no matter which
+    server it runs on.  Although the guarantee is expressed in terms of two
+    individual transactions, the GUC can also be set at session, role or
+    system level to make the guarantee generally, allowing for load balancing
+    of applications that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the
+    feature, <varname>synchronous_replay_max_lag</varname> must be set to a
+    non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</varname> can be used to
+    limit the set of standbys that can join the dynamic set of synchronous
+    replay standbys by providing a comma-separated list of application names.
+    By default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</literal></link>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</varname>.  For example,
+    if <varname>synchronous_replay_lease_time</varname> is set
+    to <literal>5s</literal>, then the clocks must not be more than 1.25
+    second apart for the guarantee to be upheld reliably during transitions.
+    The ubiquity of the Network Time Protocol (NTP) on modern operating
+    systems and availability of high quality time servers makes it possible to
+    choose a tolerance significantly higher than the maximum expected clock
+    difference.  An effort is nevertheless made to detect and report
+    misconfigured and faulty systems with clock differences greater than the
+    configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</varname> below the
+     default of 5 seconds (allowing up to 1.25 second of clock difference)
+     after researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1708,7 +1825,17 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</firstterm> with respect to a
+    transaction that has completed on the primary by
+    setting <varname>synchronous_replay</varname> to <literal>on</literal> in
+    both transactions.  For more details,
+    see <xref linkend="synchronous-replay"/>.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3bc4de57d5a..12289b142c6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1908,6 +1908,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Synchronous replay state of this standby server.  This field will
+     be non-null only if <varname>synchronous_replay_max_lag</varname> is set.
+     If a standby is in <literal>available</literal> state, then it can
+     currently serve synchronous replay queries.  If it is not replaying fast
+     enough or not responding to keepalive messages, it will be
+     in <literal>unavailable</literal> state, and if it is currently
+     transitioning to availability it will be in <literal>joining</literal>
+     state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dbaaf8e0053..67938a43b80 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5298,7 +5298,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5652e9ee6d0..453897f8fbc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -732,7 +732,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.sync_replay
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 96ba2163878..cb8b1a15155 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3676,6 +3676,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
@@ -3704,6 +3707,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 04985c9f91d..8de1c04c0e9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1105,6 +1105,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						TimestampTz timestamp;
 						bool		reply_requested;
 
+						(void) pq_getmsgint64(&s); /* skip messageNumber */
 						end_lsn = pq_getmsgint64(&s);
 						timestamp = pq_getmsgint64(&s);
 						reply_requested = pq_getmsgbyte(&s);
@@ -1312,6 +1313,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 75d26817192..a595acc750f 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -128,6 +137,229 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===========================================================
  */
 
+/*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_SYNC_REPLAY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
@@ -149,11 +381,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication.
@@ -167,6 +397,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -227,57 +463,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgement, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -399,15 +587,66 @@ SyncRepInitConfig(void)
 	}
 }
 
+/*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
 /*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -421,13 +660,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -465,9 +706,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -476,24 +718,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -991,9 +1245,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1010,7 +1263,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1077,7 +1330,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1128,6 +1381,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgement, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1197,6 +1508,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a39a98ff187..551b09338f8 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -57,6 +57,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -139,9 +140,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -472,7 +474,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -517,7 +519,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -575,7 +577,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -880,6 +882,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -899,7 +903,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -909,7 +913,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -917,15 +922,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1088,7 +1095,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1106,9 +1113,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1155,6 +1165,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1287,15 +1298,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1303,6 +1355,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1340,7 +1394,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 67b1a074cce..600f974668c 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -376,3 +377,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d46374ddce9..e9157828a3c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -168,9 +168,23 @@ static StringInfoData tmpbuf;
  */
 static TimestampTz last_reply_timestamp = 0;
 
+static TimestampTz last_keepalive_timestamp = 0;
+
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -240,7 +254,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -281,6 +295,61 @@ InitWalSender(void)
 	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
+/*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
 /*
  * Clean up after an error.
  *
@@ -309,7 +378,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -321,6 +393,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1602,6 +1676,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1618,6 +1693,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1667,6 +1743,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1764,9 +1841,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1774,6 +1853,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1788,17 +1868,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1814,8 +1894,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1826,11 +1951,55 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2020,33 +2189,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2062,20 +2250,33 @@ WalSndComputeSleeptime(TimestampTz now)
 /*
  * Check whether there have been responses by the client within
  * wal_sender_timeout and shutdown if not.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && now >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && now >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2100,6 +2301,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	last_reply_timestamp = GetCurrentTimestamp();
 	waiting_for_ping_response = false;
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2265,6 +2469,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3147,6 +3352,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3166,7 +3392,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3220,6 +3446,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			priority;
 		int			pid;
 		WalSndState state;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3232,6 +3459,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3315,6 +3543,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3330,21 +3561,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3359,23 +3638,35 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (now >= ping_time)
 	{
 		WalSndKeepalive(true);
 		waiting_for_ping_response = true;
+		last_keepalive_timestamp = now;
 
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
@@ -3415,7 +3706,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == LagTracker.read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 9871d1e7931..ff49fa34d23 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -308,6 +308,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1db7845d5ab..3696490c5ad 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1675,6 +1675,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
@@ -2922,6 +2932,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3603,6 +3635,17 @@ static struct config_string ConfigureNamesString[] =
 		check_synchronous_standby_names, assign_synchronous_standby_names, NULL
 	},
 
+	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
 	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39272925fb7..fb2a8ed949f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -254,6 +254,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -284,6 +295,14 @@
 					# (change requires restart)
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index e58c69dbd73..73156ef9b6a 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -331,6 +333,17 @@ GetTransactionSnapshot(void)
 			elog(ERROR,
 				 "cannot take query snapshot during a parallel operation");
 
+		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
 		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 53e4661d680..8d02037411a 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -117,7 +117,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -150,6 +150,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
@@ -469,6 +471,8 @@ StreamLogicalLog(void)
 			 * rest.
 			 */
 			pos = 1;			/* skip msgtype 'k' */
+			pos += 8;			/* skip messageNumber */
+
 			walEnd = fe_recvint64(&copybuf[pos]);
 			output_written_lsn = Max(walEnd, output_written_lsn);
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 10768786301..a801224ad94 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -328,7 +328,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -346,6 +346,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
@@ -1016,6 +1018,7 @@ ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 	 * check if the server requested a reply, and ignore the rest.
 	 */
 	pos = 1;					/* skip msgtype 'k' */
+	pos += 8;					/* skip messageNumber */
 	pos += 8;					/* skip walEnd */
 	pos += 8;					/* skip sendTime */
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index c00d055940c..119553bb727 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2903,7 +2903,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info			  PGNSP PGUID 12 1 100 0 0 f f f f t t s r 1 0 2249 "25" "{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" "{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}" _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25}" "{o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 0 f f f f f t s r 0 0 2249 "" "{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay}" _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" "{o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}" _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239bf..635b58e79a1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY
 } WaitEventIPC;
 
 /* ----------
@@ -845,7 +846,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index bc43b4e1090..6a5bfcbb9ce 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index ea7967f6fc5..779371c380d 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -82,6 +82,13 @@ typedef struct
 	XLogRecPtr	receivedUpto;
 	TimeLineID	receivedTLI;
 
+	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
@@ -299,4 +306,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 4b904779361..0909a64bdad 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -101,6 +113,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5acb92f30fa..58c48e2d72d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1859,9 +1859,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.sync_replay
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.15.1

#2Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#1)
Re: Synchronous replay take III

On Thu, Mar 1, 2018 at 2:39 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Moved to next CF based on
/messages/by-id/24193.1519882945@sss.pgh.pa.us
.

--
Thomas Munro
http://www.enterprisedb.com

#3Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#2)
Re: Synchronous replay take III

On Thu, Mar 01, 2018 at 06:55:18PM +1300, Thomas Munro wrote:

On Thu, Mar 1, 2018 at 2:39 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Moved to next CF based on
/messages/by-id/24193.1519882945@sss.pgh.pa.us

Thanks, Thomas. This looks like the right move to me.
--
Michael

#4Adam Brusselback
adambrusselback@gmail.com
In reply to: Michael Paquier (#3)
Re: Synchronous replay take III

Thanks Thomas, appreciate the rebase and the work you've done on this.
I should have some time to test this out over the weekend.

-Adam

#5Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Adam Brusselback (#4)
1 attachment(s)
Re: Synchronous replay take III

On Sat, Mar 3, 2018 at 2:11 AM, Adam Brusselback
<adambrusselback@gmail.com> wrote:

Thanks Thomas, appreciate the rebase and the work you've done on this.
I should have some time to test this out over the weekend.

Rebased. Moved to September. I still need to provide a TAP test and
explain that weirdness reported by Dmitry Dolgov, but I didn't get to
that in time for the bonus early commitfest that we're now in.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Synchronous-replay-mode-for-avoiding-stale-reads--v6.patchapplication/octet-stream; name=0001-Synchronous-replay-mode-for-avoiding-stale-reads--v6.patchDownload
From 824c8afbd80345847b3a0dc648e1a4cadd01cee6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 12 Apr 2017 11:02:36 +1200
Subject: [PATCH] Synchronous replay mode for avoiding stale reads on hot
 standbys.

While the existing synchronous replication support is mainly concerned with
increasing durability, synchronous replay is concerned with increasing
availability.  When two transactions tx1, tx2 are run with synchronous_replay
set to on and tx1 reports successful commit before tx2 begins, then tx2 is
guaranteed either to see tx1 or to raise a new error 40P02 if it is run on a
hot standby.

Compared to the remote_apply feature introduced by commit 314cbfc5,
synchronous replay allows for graceful failure, certainty about which
standbys can provide non-stale reads in multi-standby configurations and a
limit on how much standbys can slow the primary server down.

To make effective use of this feature, clients require some intelligence
to route read-only transactions and to avoid servers that have recently
raised error 40P02.  It is anticipated that application frameworks and
middleware will be able to provide such intelligence so that application code
can remain unaware of whether read transactions are run on different servers.

Heikki Linnakangas and Simon Riggs expressed the view that this approach is
inferior to one based on clients tracking commit LSNs and asking standby
servers to wait for replay, but other reviewers have expressed support for
both approaches being available to users.

Author: Thomas Munro
Reviewed-By: Dmitry Dolgov, Thom Brown, Amit Langote, Simon Riggs,
             Joel Jacobson, Heikki Linnakangas, Michael Paquier, Simon Riggs,
             Robert Haas, Ants Aasma
Discussion: https://postgr.es/m/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  87 +++
 doc/src/sgml/high-availability.sgml           | 139 ++++-
 doc/src/sgml/monitoring.sgml                  |  12 +
 src/backend/access/transam/xact.c             |   2 +-
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |   6 +
 src/backend/replication/logical/worker.c      |   2 +
 src/backend/replication/syncrep.c             | 502 +++++++++++++++---
 src/backend/replication/walreceiver.c         |  82 ++-
 src/backend/replication/walreceiverfuncs.c    |  19 +
 src/backend/replication/walsender.c           | 367 +++++++++++--
 src/backend/utils/errcodes.txt                |   1 +
 src/backend/utils/misc/guc.c                  |  43 ++
 src/backend/utils/misc/postgresql.conf.sample |  19 +
 src/backend/utils/time/snapmgr.c              |  13 +
 src/bin/pg_basebackup/pg_recvlogical.c        |   6 +-
 src/bin/pg_basebackup/receivelog.c            |   5 +-
 src/include/catalog/pg_proc.dat               |   6 +-
 src/include/pgstat.h                          |   6 +-
 src/include/replication/syncrep.h             |  16 +-
 src/include/replication/walreceiver.h         |   9 +
 src/include/replication/walsender_private.h   |  20 +
 src/test/regress/expected/rules.out           |   5 +-
 23 files changed, 1216 insertions(+), 154 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5b913f00c1d..69543c6427b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3017,6 +3017,36 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</varname> set to <literal>on</literal> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"/> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Servers</title>
 
@@ -3331,6 +3361,63 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby.
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</firstterm>, as described in
+        <xref linkend="synchronous-replay"/>.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</literal></link>.
+        The default is <literal>*</literal>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</varname>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 934eb9052d9..dfbd8c71907 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1158,11 +1158,12 @@ primary_slot_name = 'node_a_slot'
    </para>
 
    <para>
-    Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
-    cause each commit to wait until the current synchronous standbys report
-    that they have replayed the transaction, making it visible to user
-    queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    Setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> will cause each commit to wait until
+    the current synchronous standbys report that they have replayed the
+    transaction, making it visible to user queries.  In simple cases, this
+    allows for load balancing with causal consistency.  See also
+    <xref linkend="synchronous-replay"/>.
    </para>
 
    <para>
@@ -1360,6 +1361,122 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</varname> to <literal>on</literal> in any
+    pair of consecutive transactions tx1, tx2 where tx2 begins after tx1
+    successfully returns, then tx2 will either see tx1 or fail with a new
+    error "standby is not available for synchronous replay", no matter which
+    server it runs on.  Although the guarantee is expressed in terms of two
+    individual transactions, the GUC can also be set at session, role or
+    system level to make the guarantee generally, allowing for load balancing
+    of applications that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the
+    feature, <varname>synchronous_replay_max_lag</varname> must be set to a
+    non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</varname> can be used to
+    limit the set of standbys that can join the dynamic set of synchronous
+    replay standbys by providing a comma-separated list of application names.
+    By default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</literal></link>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</varname>.  For example,
+    if <varname>synchronous_replay_lease_time</varname> is set
+    to <literal>5s</literal>, then the clocks must not be more than 1.25
+    second apart for the guarantee to be upheld reliably during transitions.
+    The ubiquity of the Network Time Protocol (NTP) on modern operating
+    systems and availability of high quality time servers makes it possible to
+    choose a tolerance significantly higher than the maximum expected clock
+    difference.  An effort is nevertheless made to detect and report
+    misconfigured and faulty systems with clock differences greater than the
+    configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</varname> below the
+     default of 5 seconds (allowing up to 1.25 second of clock difference)
+     after researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1708,7 +1825,17 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</firstterm> with respect to a
+    transaction that has completed on the primary by
+    setting <varname>synchronous_replay</varname> to <literal>on</literal> in
+    both transactions.  For more details,
+    see <xref linkend="synchronous-replay"/>.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c2adb22dff9..a14811af44b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1908,6 +1908,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Synchronous replay state of this standby server.  This field will
+     be non-null only if <varname>synchronous_replay_max_lag</varname> is set.
+     If a standby is in <literal>available</literal> state, then it can
+     currently serve synchronous replay queries.  If it is not replaying fast
+     enough or not responding to keepalive messages, it will be
+     in <literal>unavailable</literal> state, and if it is currently
+     transitioning to availability it will be in <literal>joining</literal>
+     state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8e6aef332cb..9ec5733a90a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5272,7 +5272,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8cd8bf40ac4..099f758566d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -734,7 +734,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.sync_replay
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e77c0..5bf0bb7a67e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3683,6 +3683,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
@@ -3711,6 +3714,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0d2b795e392..09bb87f07a9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1192,6 +1192,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						TimestampTz timestamp;
 						bool		reply_requested;
 
+						(void) pq_getmsgint64(&s); /* skip messageNumber */
 						end_lsn = pq_getmsgint64(&s);
 						timestamp = pq_getmsgint64(&s);
 						reply_requested = pq_getmsgbyte(&s);
@@ -1399,6 +1400,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 75d26817192..a595acc750f 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -128,6 +137,229 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===========================================================
  */
 
+/*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_SYNC_REPLAY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
@@ -149,11 +381,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication.
@@ -167,6 +397,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -227,57 +463,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgement, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -399,15 +587,66 @@ SyncRepInitConfig(void)
 	}
 }
 
+/*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
 /*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -421,13 +660,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -465,9 +706,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -476,24 +718,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -991,9 +1245,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1010,7 +1263,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1077,7 +1330,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1128,6 +1381,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgement, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1197,6 +1508,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 987bb84683c..7d2cbee0331 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -58,6 +58,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -140,9 +141,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -486,7 +488,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -531,7 +533,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -589,7 +591,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -894,6 +896,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -913,7 +917,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -923,7 +927,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -931,15 +936,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1102,7 +1109,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1120,9 +1127,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1169,6 +1179,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1301,15 +1312,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1317,6 +1369,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1354,7 +1408,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 67b1a074cce..600f974668c 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -376,3 +377,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e47ddca6bca..f5b27be6dba 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -168,9 +168,23 @@ static StringInfoData tmpbuf;
  */
 static TimestampTz last_reply_timestamp = 0;
 
+static TimestampTz last_keepalive_timestamp = 0;
+
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -240,7 +254,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -281,6 +295,61 @@ InitWalSender(void)
 	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
+/*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
 /*
  * Clean up after an error.
  *
@@ -309,7 +378,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -321,6 +393,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1602,6 +1676,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1618,6 +1693,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1667,6 +1743,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1764,9 +1841,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1774,6 +1853,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1788,17 +1868,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1814,8 +1894,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1826,11 +1951,55 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2020,33 +2189,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2062,20 +2250,33 @@ WalSndComputeSleeptime(TimestampTz now)
 /*
  * Check whether there have been responses by the client within
  * wal_sender_timeout and shutdown if not.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(TimestampTz now)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && now >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && now >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2100,6 +2301,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	last_reply_timestamp = GetCurrentTimestamp();
 	waiting_for_ping_response = false;
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2265,6 +2469,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3147,6 +3352,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3166,7 +3392,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3220,6 +3446,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			priority;
 		int			pid;
 		WalSndState state;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3232,6 +3459,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3315,6 +3543,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3330,21 +3561,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3359,23 +3638,35 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_keepalive_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (now >= ping_time)
 	{
 		WalSndKeepalive(true);
 		waiting_for_ping_response = true;
+		last_keepalive_timestamp = now;
 
 		/* Try to flush pending output to the client */
 		if (pq_flush_if_writable() != 0)
@@ -3415,7 +3706,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == LagTracker.read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index e2976600e84..4c3ab824ca6 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -308,6 +308,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b05fb209bba..4910b7b8bc3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1716,6 +1716,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
@@ -3056,6 +3066,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3777,6 +3809,17 @@ static struct config_string ConfigureNamesString[] =
 		check_synchronous_standby_names, assign_synchronous_standby_names, NULL
 	},
 
+	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
 	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf4668..bbfed85a7e6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -256,6 +256,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -286,6 +297,14 @@
 					# (change requires restart)
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 4b45d3cccd2..10ca64af47b 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -331,6 +333,17 @@ GetTransactionSnapshot(void)
 			elog(ERROR,
 				 "cannot take query snapshot during a parallel operation");
 
+		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
 		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index ef85c9af4c7..cfe651c636f 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -118,7 +118,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -151,6 +151,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
@@ -470,6 +472,8 @@ StreamLogicalLog(void)
 			 * rest.
 			 */
 			pos = 1;			/* skip msgtype 'k' */
+			pos += 8;			/* skip messageNumber */
+
 			walEnd = fe_recvint64(&copybuf[pos]);
 			output_written_lsn = Max(walEnd, output_written_lsn);
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 10768786301..a801224ad94 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -328,7 +328,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -346,6 +346,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
@@ -1016,6 +1018,7 @@ ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 	 * check if the server requested a reply, and ignore the rest.
 	 */
 	pos = 1;					/* skip msgtype 'k' */
+	pos += 8;					/* skip messageNumber */
 	pos += 8;					/* skip walEnd */
 	pos += 8;					/* skip sendTime */
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 40d54ed0302..f746cd27e42 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5181,9 +5181,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,text}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239bf..635b58e79a1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY
 } WaitEventIPC;
 
 /* ----------
@@ -845,7 +846,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index bc43b4e1090..6a5bfcbb9ce 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 5913b580c2b..58709e2e9be 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	XLogRecPtr	receivedUpto;
 	TimeLineID	receivedTLI;
 
+	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
@@ -313,4 +320,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 4b904779361..0909a64bdad 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -101,6 +113,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ae0cd253d5f..51569d1b8fa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1861,9 +1861,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.sync_replay
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.17.0

#6Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#5)
1 attachment(s)
Re: Synchronous replay take III

On Mon, Jul 2, 2018 at 12:39 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Mar 3, 2018 at 2:11 AM, Adam Brusselback
<adambrusselback@gmail.com> wrote:

Thanks Thomas, appreciate the rebase and the work you've done on this.
I should have some time to test this out over the weekend.

Rebased.

Rebased.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Synchronous-replay-mode-for-avoiding-stale-reads--v7.patchapplication/octet-stream; name=0001-Synchronous-replay-mode-for-avoiding-stale-reads--v7.patchDownload
From e7204e43938e8e40d4deaf019ae85cbf42ba787f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 12 Apr 2017 11:02:36 +1200
Subject: [PATCH] Synchronous replay mode for avoiding stale reads on hot
 standbys.

While the existing synchronous replication support is mainly concerned with
increasing durability, synchronous replay is concerned with increasing
availability.  When two transactions tx1, tx2 are run with synchronous_replay
set to on and tx1 reports successful commit before tx2 begins, then tx2 is
guaranteed either to see tx1 or to raise a new error 40P02 if it is run on a
hot standby.

Compared to the remote_apply feature introduced by commit 314cbfc5,
synchronous replay allows for graceful failure, certainty about which
standbys can provide non-stale reads in multi-standby configurations and a
limit on how much standbys can slow the primary server down.

To make effective use of this feature, clients require some intelligence
to route read-only transactions and to avoid servers that have recently
raised error 40P02.  It is anticipated that application frameworks and
middleware will be able to provide such intelligence so that application code
can remain unaware of whether read transactions are run on different servers.

Heikki Linnakangas and Simon Riggs expressed the view that this approach is
inferior to one based on clients tracking commit LSNs and asking standby
servers to wait for replay, but other reviewers have expressed support for
both approaches being available to users.

Author: Thomas Munro
Reviewed-By: Dmitry Dolgov, Thom Brown, Amit Langote, Simon Riggs,
             Joel Jacobson, Heikki Linnakangas, Michael Paquier, Simon Riggs,
             Robert Haas, Ants Aasma
Discussion: https://postgr.es/m/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  87 +++
 doc/src/sgml/high-availability.sgml           | 139 ++++-
 doc/src/sgml/monitoring.sgml                  |  12 +
 src/backend/access/transam/xact.c             |   2 +-
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |   6 +
 src/backend/replication/logical/worker.c      |   2 +
 src/backend/replication/syncrep.c             | 502 +++++++++++++++---
 src/backend/replication/walreceiver.c         |  82 ++-
 src/backend/replication/walreceiverfuncs.c    |  19 +
 src/backend/replication/walsender.c           | 364 +++++++++++--
 src/backend/utils/errcodes.txt                |   1 +
 src/backend/utils/misc/guc.c                  |  43 ++
 src/backend/utils/misc/postgresql.conf.sample |  19 +
 src/backend/utils/time/snapmgr.c              |  13 +
 src/bin/pg_basebackup/pg_recvlogical.c        |   6 +-
 src/bin/pg_basebackup/receivelog.c            |   5 +-
 src/include/catalog/pg_proc.dat               |   6 +-
 src/include/pgstat.h                          |   6 +-
 src/include/replication/syncrep.h             |  16 +-
 src/include/replication/walreceiver.h         |   9 +
 src/include/replication/walsender_private.h   |  20 +
 src/test/regress/expected/rules.out           |   5 +-
 23 files changed, 1213 insertions(+), 154 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f11b8f724cd..67eac514ff5 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3017,6 +3017,36 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</varname> set to <literal>on</literal> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"/> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Servers</title>
 
@@ -3335,6 +3365,63 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby.
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</firstterm>, as described in
+        <xref linkend="synchronous-replay"/>.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</literal></link>.
+        The default is <literal>*</literal>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</varname>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6f57362df7f..5d503061373 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1158,11 +1158,12 @@ primary_slot_name = 'node_a_slot'
    </para>
 
    <para>
-    Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
-    cause each commit to wait until the current synchronous standbys report
-    that they have replayed the transaction, making it visible to user
-    queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    Setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> will cause each commit to wait until
+    the current synchronous standbys report that they have replayed the
+    transaction, making it visible to user queries.  In simple cases, this
+    allows for load balancing with causal consistency.  See also
+    <xref linkend="synchronous-replay"/>.
    </para>
 
    <para>
@@ -1360,6 +1361,122 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</varname> to <literal>on</literal> in any
+    pair of consecutive transactions tx1, tx2 where tx2 begins after tx1
+    successfully returns, then tx2 will either see tx1 or fail with a new
+    error "standby is not available for synchronous replay", no matter which
+    server it runs on.  Although the guarantee is expressed in terms of two
+    individual transactions, the GUC can also be set at session, role or
+    system level to make the guarantee generally, allowing for load balancing
+    of applications that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the
+    feature, <varname>synchronous_replay_max_lag</varname> must be set to a
+    non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</varname> can be used to
+    limit the set of standbys that can join the dynamic set of synchronous
+    replay standbys by providing a comma-separated list of application names.
+    By default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</literal></link>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</varname>.  For example,
+    if <varname>synchronous_replay_lease_time</varname> is set
+    to <literal>5s</literal>, then the clocks must not be more than 1.25
+    second apart for the guarantee to be upheld reliably during transitions.
+    The ubiquity of the Network Time Protocol (NTP) on modern operating
+    systems and availability of high quality time servers makes it possible to
+    choose a tolerance significantly higher than the maximum expected clock
+    difference.  An effort is nevertheless made to detect and report
+    misconfigured and faulty systems with clock differences greater than the
+    configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</varname> below the
+     default of 5 seconds (allowing up to 1.25 second of clock difference)
+     after researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1708,7 +1825,17 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</firstterm> with respect to a
+    transaction that has completed on the primary by
+    setting <varname>synchronous_replay</varname> to <literal>on</literal> in
+    both transactions.  For more details,
+    see <xref linkend="synchronous-replay"/>.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0484cfa77ad..6e98f0eac9a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1912,6 +1912,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Synchronous replay state of this standby server.  This field will
+     be non-null only if <varname>synchronous_replay_max_lag</varname> is set.
+     If a standby is in <literal>available</literal> state, then it can
+     currently serve synchronous replay queries.  If it is not replaying fast
+     enough or not responding to keepalive messages, it will be
+     in <literal>unavailable</literal> state, and if it is currently
+     transitioning to availability it will be in <literal>joining</literal>
+     state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 875be180fe4..e7d808277a8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5255,7 +5255,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 72515524199..e99bc5bfb8e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -734,7 +734,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.sync_replay
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8a5b2b3b420..d2e1da33725 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3683,6 +3683,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
@@ -3711,6 +3714,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2054abe6532..e921444a682 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1194,6 +1194,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						TimestampTz timestamp;
 						bool		reply_requested;
 
+						(void) pq_getmsgint64(&s); /* skip messageNumber */
 						end_lsn = pq_getmsgint64(&s);
 						timestamp = pq_getmsgint64(&s);
 						reply_requested = pq_getmsgbyte(&s);
@@ -1401,6 +1402,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 75d26817192..a595acc750f 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -128,6 +137,229 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===========================================================
  */
 
+/*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_SYNC_REPLAY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
@@ -149,11 +381,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication.
@@ -167,6 +397,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -227,57 +463,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgement, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -399,15 +587,66 @@ SyncRepInitConfig(void)
 	}
 }
 
+/*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
 /*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -421,13 +660,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -465,9 +706,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -476,24 +718,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -991,9 +1245,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1010,7 +1263,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1077,7 +1330,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1128,6 +1381,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgement, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1197,6 +1508,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 6f4b3538ac4..291144ee600 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -58,6 +58,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -140,9 +141,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -480,7 +482,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -525,7 +527,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -583,7 +585,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -882,6 +884,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -901,7 +905,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -911,7 +915,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -919,15 +924,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1090,7 +1097,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1108,9 +1115,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1157,6 +1167,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1289,15 +1300,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1305,6 +1357,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1342,7 +1396,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 67b1a074cce..600f974668c 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -376,3 +377,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 370429d746c..261dede5087 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -173,6 +173,18 @@ static TimestampTz last_reply_timestamp = 0;
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -242,7 +254,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(void);
 static void WalSndCheckTimeOut(void);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -285,6 +297,61 @@ InitWalSender(void)
 	memset(&LagTracker, 0, sizeof(LagTracker));
 }
 
+/*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
 /*
  * Clean up after an error.
  *
@@ -313,7 +380,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -325,6 +395,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1612,6 +1684,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1628,6 +1701,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1677,6 +1751,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1774,9 +1849,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1784,6 +1861,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1798,17 +1876,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1824,8 +1902,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1836,11 +1959,55 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2030,33 +2197,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_reply_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2080,20 +2266,33 @@ WalSndComputeSleeptime(TimestampTz now)
  * message every standby_message_timeout = wal_sender_timeout/6 = 10s.  We
  * could eliminate that problem by recognizing timeout expiration at
  * wal_sender_timeout/2 after the keepalive.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(void)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && last_processing >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && last_processing >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2118,6 +2317,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	last_reply_timestamp = GetCurrentTimestamp();
 	waiting_for_ping_response = false;
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2283,6 +2485,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3183,6 +3386,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3202,7 +3426,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3256,6 +3480,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			priority;
 		int			pid;
 		WalSndState state;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3268,6 +3493,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3351,6 +3577,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3366,21 +3595,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3395,19 +3672,30 @@ WalSndKeepaliveIfNecessary(void)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (last_processing >= ping_time)
 	{
 		WalSndKeepalive(true);
@@ -3451,7 +3739,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (LagTracker.write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == LagTracker.read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 29efb3f6efc..9b10f9200bf 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -308,6 +308,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e9f542cfedd..cd80ef5a20b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1724,6 +1724,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
@@ -3064,6 +3074,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3800,6 +3832,17 @@ static struct config_string ConfigureNamesString[] =
 		check_synchronous_standby_names, assign_synchronous_standby_names, NULL
 	},
 
+	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
 	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521f..21c25b55ed2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -255,6 +255,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -285,6 +296,14 @@
 					# (change requires restart)
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29d..944cc7d4949 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -331,6 +333,17 @@ GetTransactionSnapshot(void)
 			elog(ERROR,
 				 "cannot take query snapshot during a parallel operation");
 
+		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
 		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index a242e0be88b..2101243d155 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -118,7 +118,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -151,6 +151,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
@@ -470,6 +472,8 @@ StreamLogicalLog(void)
 			 * rest.
 			 */
 			pos = 1;			/* skip msgtype 'k' */
+			pos += 8;			/* skip messageNumber */
+
 			walEnd = fe_recvint64(&copybuf[pos]);
 			output_written_lsn = Max(walEnd, output_written_lsn);
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 10768786301..a801224ad94 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -328,7 +328,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -346,6 +346,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
@@ -1016,6 +1018,7 @@ ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 	 * check if the server requested a reply, and ignore the rest.
 	 */
 	pos = 1;					/* skip msgtype 'k' */
+	pos += 8;					/* skip messageNumber */
 	pos += 8;					/* skip walEnd */
 	pos += 8;					/* skip sendTime */
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 860571440a5..0e3b8aef730 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5181,9 +5181,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,text}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d59c24ae238..fefd074f434 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY
 } WaitEventIPC;
 
 /* ----------
@@ -845,7 +846,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index bc43b4e1090..6a5bfcbb9ce 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 5913b580c2b..58709e2e9be 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	XLogRecPtr	receivedUpto;
 	TimeLineID	receivedTLI;
 
+	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
@@ -313,4 +320,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 4b904779361..0909a64bdad 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -101,6 +113,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 078129f251b..5b6a17e59be 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1861,9 +1861,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.sync_replay
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.17.0

#7Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#6)
1 attachment(s)
Re: Synchronous replay take III

On Mon, Sep 24, 2018 at 10:39 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Jul 2, 2018 at 12:39 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Mar 3, 2018 at 2:11 AM, Adam Brusselback
<adambrusselback@gmail.com> wrote:

Thanks Thomas, appreciate the rebase and the work you've done on this.
I should have some time to test this out over the weekend.

Rebased.

Rebased.

Rebased.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Synchronous-replay-mode-for-avoiding-stale-reads--v8.patchapplication/octet-stream; name=0001-Synchronous-replay-mode-for-avoiding-stale-reads--v8.patchDownload
From 80507673a86dc1222219cc65cc51047b70d899c5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 12 Apr 2017 11:02:36 +1200
Subject: [PATCH] Synchronous replay mode for avoiding stale reads on hot
 standbys.

While the existing synchronous replication support is mainly concerned with
increasing durability, synchronous replay is concerned with increasing
availability.  When two transactions tx1, tx2 are run with synchronous_replay
set to on and tx1 reports successful commit before tx2 begins, then tx2 is
guaranteed either to see tx1 or to raise a new error 40P02 if it is run on a
hot standby.

Compared to the remote_apply feature introduced by commit 314cbfc5,
synchronous replay allows for graceful failure, certainty about which
standbys can provide non-stale reads in multi-standby configurations and a
limit on how much standbys can slow the primary server down.

To make effective use of this feature, clients require some intelligence
to route read-only transactions and to avoid servers that have recently
raised error 40P02.  It is anticipated that application frameworks and
middleware will be able to provide such intelligence so that application code
can remain unaware of whether read transactions are run on different servers.

Heikki Linnakangas and Simon Riggs expressed the view that this approach is
inferior to one based on clients tracking commit LSNs and asking standby
servers to wait for replay, but other reviewers have expressed support for
both approaches being available to users.

Author: Thomas Munro
Reviewed-By: Dmitry Dolgov, Thom Brown, Amit Langote, Simon Riggs,
             Joel Jacobson, Heikki Linnakangas, Michael Paquier, Simon Riggs,
             Robert Haas, Ants Aasma
Discussion: https://postgr.es/m/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  87 +++
 doc/src/sgml/high-availability.sgml           | 139 ++++-
 doc/src/sgml/monitoring.sgml                  |  12 +
 src/backend/access/transam/xact.c             |   2 +-
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |   6 +
 src/backend/replication/logical/worker.c      |   2 +
 src/backend/replication/syncrep.c             | 502 +++++++++++++++---
 src/backend/replication/walreceiver.c         |  82 ++-
 src/backend/replication/walreceiverfuncs.c    |  19 +
 src/backend/replication/walsender.c           | 364 +++++++++++--
 src/backend/utils/errcodes.txt                |   1 +
 src/backend/utils/misc/guc.c                  |  43 ++
 src/backend/utils/misc/postgresql.conf.sample |  19 +
 src/backend/utils/time/snapmgr.c              |  13 +
 src/bin/pg_basebackup/pg_recvlogical.c        |   6 +-
 src/bin/pg_basebackup/receivelog.c            |   5 +-
 src/include/catalog/pg_proc.dat               |   6 +-
 src/include/pgstat.h                          |   6 +-
 src/include/replication/syncrep.h             |  16 +-
 src/include/replication/walreceiver.h         |   9 +
 src/include/replication/walsender_private.h   |  20 +
 src/test/regress/expected/rules.out           |   5 +-
 23 files changed, 1213 insertions(+), 154 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7554cba3f96..b7ea981ebe0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3018,6 +3018,36 @@ include_dir 'conf.d'
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</varname> set to <literal>on</literal> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"/> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Servers</title>
 
@@ -3336,6 +3366,63 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby.
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</firstterm>, as described in
+        <xref linkend="synchronous-replay"/>.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</literal></link>.
+        The default is <literal>*</literal>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</varname>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index ebcb3daaed6..78d6abe3a48 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1158,11 +1158,12 @@ primary_slot_name = 'node_a_slot'
    </para>
 
    <para>
-    Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
-    cause each commit to wait until the current synchronous standbys report
-    that they have replayed the transaction, making it visible to user
-    queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    Setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> will cause each commit to wait until
+    the current synchronous standbys report that they have replayed the
+    transaction, making it visible to user queries.  In simple cases, this
+    allows for load balancing with causal consistency.  See also
+    <xref linkend="synchronous-replay"/>.
    </para>
 
    <para>
@@ -1360,6 +1361,122 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</varname> to <literal>on</literal> in any
+    pair of consecutive transactions tx1, tx2 where tx2 begins after tx1
+    successfully returns, then tx2 will either see tx1 or fail with a new
+    error "standby is not available for synchronous replay", no matter which
+    server it runs on.  Although the guarantee is expressed in terms of two
+    individual transactions, the GUC can also be set at session, role or
+    system level to make the guarantee generally, allowing for load balancing
+    of applications that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the
+    feature, <varname>synchronous_replay_max_lag</varname> must be set to a
+    non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</varname> can be used to
+    limit the set of standbys that can join the dynamic set of synchronous
+    replay standbys by providing a comma-separated list of application names.
+    By default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</literal></link>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</varname>.  For example,
+    if <varname>synchronous_replay_lease_time</varname> is set
+    to <literal>5s</literal>, then the clocks must not be more than 1.25
+    second apart for the guarantee to be upheld reliably during transitions.
+    The ubiquity of the Network Time Protocol (NTP) on modern operating
+    systems and availability of high quality time servers makes it possible to
+    choose a tolerance significantly higher than the maximum expected clock
+    difference.  An effort is nevertheless made to detect and report
+    misconfigured and faulty systems with clock differences greater than the
+    configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</varname> below the
+     default of 5 seconds (allowing up to 1.25 second of clock difference)
+     after researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1708,7 +1825,17 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</firstterm> with respect to a
+    transaction that has completed on the primary by
+    setting <varname>synchronous_replay</varname> to <literal>on</literal> in
+    both transactions.  For more details,
+    see <xref linkend="synchronous-replay"/>.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0484cfa77ad..6e98f0eac9a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1912,6 +1912,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Synchronous replay state of this standby server.  This field will
+     be non-null only if <varname>synchronous_replay_max_lag</varname> is set.
+     If a standby is in <literal>available</literal> state, then it can
+     currently serve synchronous replay queries.  If it is not replaying fast
+     enough or not responding to keepalive messages, it will be
+     in <literal>unavailable</literal> state, and if it is currently
+     transitioning to availability it will be in <literal>joining</literal>
+     state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8c1621d949c..5abb5a32f55 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5293,7 +5293,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a03b005f73e..c387e52b5c6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -734,7 +734,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.sync_replay
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8de603d1933..e41d1237851 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3675,6 +3675,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
@@ -3703,6 +3706,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 277da69fa6c..6bda3569c69 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1195,6 +1195,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						TimestampTz timestamp;
 						bool		reply_requested;
 
+						(void) pq_getmsgint64(&s); /* skip messageNumber */
 						end_lsn = pq_getmsgint64(&s);
 						timestamp = pq_getmsgint64(&s);
 						reply_requested = pq_getmsgbyte(&s);
@@ -1402,6 +1403,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index af5ad5fe66f..722bcc7c6c4 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -128,6 +137,229 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===========================================================
  */
 
+/*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+					  WAIT_EVENT_SYNC_REPLAY);
+		else
+			WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+					  stallTimeMillis,
+					  WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
@@ -149,11 +381,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication.
@@ -167,6 +397,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -227,57 +463,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown/death. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgment, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (!PostmasterIsAlive())
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -399,15 +587,66 @@ SyncRepInitConfig(void)
 	}
 }
 
+/*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
 /*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -421,13 +660,15 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -465,9 +706,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -476,24 +718,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -991,9 +1245,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1010,7 +1263,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1077,7 +1330,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1128,6 +1381,64 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * If the postmaster dies, we'll probably never get an
+	 * acknowledgment, because all the wal sender processes will exit. So
+	 * just bail out.
+	 */
+	if (!PostmasterIsAlive())
+	{
+		ProcDiePending = true;
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1197,6 +1508,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 6f4b3538ac4..291144ee600 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -58,6 +58,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -140,9 +141,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -480,7 +482,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -525,7 +527,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_POSTMASTER_DEATH)
@@ -583,7 +585,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -882,6 +884,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -901,7 +905,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -911,7 +915,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -919,15 +924,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1090,7 +1097,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1108,9 +1115,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1157,6 +1167,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1289,15 +1300,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1305,6 +1357,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1342,7 +1396,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 67b1a074cce..600f974668c 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -376,3 +377,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2683385ca6e..376d1a0a93e 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -173,6 +173,18 @@ static TimestampTz last_reply_timestamp = 0;
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -244,7 +256,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(void);
 static void WalSndCheckTimeOut(void);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -287,6 +299,61 @@ InitWalSender(void)
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
 }
 
+/*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
 /*
  * Clean up after an error.
  *
@@ -315,7 +382,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -327,6 +397,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1614,6 +1686,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1630,6 +1703,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1679,6 +1753,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1776,9 +1851,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1786,6 +1863,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1800,17 +1878,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1826,8 +1904,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1838,11 +1961,55 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2032,33 +2199,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_reply_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2082,20 +2268,33 @@ WalSndComputeSleeptime(TimestampTz now)
  * message every standby_message_timeout = wal_sender_timeout/6 = 10s.  We
  * could eliminate that problem by recognizing timeout expiration at
  * wal_sender_timeout/2 after the keepalive.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(void)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && last_processing >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && last_processing >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2120,6 +2319,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	last_reply_timestamp = GetCurrentTimestamp();
 	waiting_for_ping_response = false;
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2285,6 +2487,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3185,6 +3388,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3204,7 +3428,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3258,6 +3482,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			priority;
 		int			pid;
 		WalSndState state;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3270,6 +3495,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3353,6 +3579,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3368,21 +3597,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3397,19 +3674,30 @@ WalSndKeepaliveIfNecessary(void)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (last_processing >= ping_time)
 	{
 		WalSndKeepalive(true);
@@ -3453,7 +3741,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == lag_tracker->read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 788f88129bd..bf96ebc825c 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -308,6 +308,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2317e8be6be..a544f98eee3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1724,6 +1724,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
@@ -3065,6 +3075,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3790,6 +3822,17 @@ static struct config_string ConfigureNamesString[] =
 		check_synchronous_standby_names, assign_synchronous_standby_names, NULL
 	},
 
+	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
 	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521f..21c25b55ed2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -255,6 +255,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -285,6 +296,14 @@
 					# (change requires restart)
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29d..944cc7d4949 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -331,6 +333,17 @@ GetTransactionSnapshot(void)
 			elog(ERROR,
 				 "cannot take query snapshot during a parallel operation");
 
+		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
 		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index a242e0be88b..2101243d155 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -118,7 +118,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -151,6 +151,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
@@ -470,6 +472,8 @@ StreamLogicalLog(void)
 			 * rest.
 			 */
 			pos = 1;			/* skip msgtype 'k' */
+			pos += 8;			/* skip messageNumber */
+
 			walEnd = fe_recvint64(&copybuf[pos]);
 			output_written_lsn = Max(walEnd, output_written_lsn);
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 10768786301..a801224ad94 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -328,7 +328,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -346,6 +346,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
@@ -1016,6 +1018,7 @@ ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 	 * check if the server requested a reply, and ignore the rest.
 	 */
 	pos = 1;					/* skip msgtype 'k' */
+	pos += 8;					/* skip messageNumber */
 	pos += 8;					/* skip walEnd */
 	pos += 8;					/* skip sendTime */
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cff58ed2d89..35114907b1e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5014,9 +5014,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,text}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d59c24ae238..fefd074f434 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY
 } WaitEventIPC;
 
 /* ----------
@@ -845,7 +846,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index bc43b4e1090..6a5bfcbb9ce 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 5913b580c2b..58709e2e9be 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	XLogRecPtr	receivedUpto;
 	TimeLineID	receivedTLI;
 
+	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
@@ -313,4 +320,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 4b904779361..0909a64bdad 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -101,6 +113,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 735dd37acff..bf32925b67f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1861,9 +1861,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.sync_replay
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.19.1

#8Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Thomas Munro (#1)
Re: Synchronous replay take III

On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Hi hackers,

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Thank you for working on this. The overview and your summary was
helpful for me to understand this feature, thank you. I've started to
review this patch for PostgreSQL 12. I've tested this patch and found
some issue but let me ask you questions about the high-level design
first. Sorry if these have been already discussed.

In previous threads[1][2][3] I called this feature proposal "causal
reads". That was a terrible name, borrowed from MySQL. While it is
probably a useful term of art, for one thing people kept reading it as
"casual", which it ain't, and more importantly this patch is only one
way to achieve read-follows-write causal consistency. Several others
are proposed or exist in forks (user managed wait-for-LSN, global
transaction manager, ...).

OVERVIEW

For writers, it works a bit like RAID mirroring: when you commit a
write transaction, it waits until the data has become visible on all
elements of the array, and if an array element is not responding fast
enough it is kicked out of the array. For readers, it's a little
different because you're connected directly to the array elements
(rather than going through a central controller), so it uses a system
of leases allowing read transactions to know instantly and whether
they are running on an element that is currently in the array and are
therefore able to service synchronous_replay transactions, or should
raise an error telling you to go and ask some other element.

This is a design choice favouring read-mostly workloads at the expense
of write transactions. Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server. This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

To be clear what did you mean read-mostly workloads?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#9Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Masahiko Sawada (#8)
Re: Synchronous replay take III

On Thu, Nov 15, 2018 at 6:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:

In previous threads[1][2][3] I called this feature proposal "causal
reads". That was a terrible name, borrowed from MySQL. While it is
probably a useful term of art, for one thing people kept reading it as
"casual"

Yeah, that was rather annoying that I couldn't get rid of this while playing
with the "take II" version :)

To be clear what did you mean read-mostly workloads?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Unfortunately, cfbot says that patch can't be applied without conflicts, could
you please post a rebased version and address commentaries from Masahiko?

#10Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Dmitry Dolgov (#9)
1 attachment(s)
Re: Synchronous replay take III

On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Unfortunately, cfbot says that patch can't be applied without conflicts, could
you please post a rebased version and address commentaries from Masahiko?

Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death. I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version. If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Thank you for working on this. The overview and your summary was
helpful for me to understand this feature, thank you. I've started to
review this patch for PostgreSQL 12. I've tested this patch and found
some issue but let me ask you questions about the high-level design
first. Sorry if these have been already discussed.

Thanks for your interest in this work!

This is a design choice favouring read-mostly workloads at the expense
of write transactions. Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server. This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

To be clear what did you mean read-mostly workloads?

I mean workloads where only a small percentage of transactions perform
a write. If you need write-scalability, then hot_standby is not the
solution for you (with or without this patch).

The kind of user who would be interested in this feature is someone
who already uses some kind of heuristics to move some queries to
read-only standbys. For example, some people send transaction for
logged-in users to the primary database (because only logged-in users
generate write queries), and all the rest to standby servers (for
example "public" users who can only read content). Another technique
I have seen is to keep user sessions "pinned" on the primary server
for N minutes after they perform a write transaction. These types of
load balancing policies are primitive ways of achieving
read-your-writes consistency, but they are conservative and
pessimistic: they probably send too many queries to the primary node.

This proposal is much more precise, allowing you to run the minimum
number of transactions on the primary node (ie transactions that
actually need to perform a write), and the maximum number of
transactions on the hot standbys.

As discussed, making reads wait for a token would be a useful
alternative (and I am willing to help make that work too), but:

1. For users that do more many more reads than writes, would you
rather make (say) 80% of transactions slower or 20%? (Or 99% vs 1% as
the case may be, depending on your application.)

2. If you are also using synchronous_commit = on for increased
durability, then you are already making writers wait, and you might be
able to tolerate a small increase.

Peter Eisentraut expressed an interesting point of view against this
general line of thinking:

/messages/by-id/5643933F.4010701@gmx.net

My questions are: Why do we have hot_standby mode? Is load balancing
a style of usage we want to support? Do we want a technology that
lets you do more of it?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

I agree 100% that this is not a solution for all users. But I also
suspect a token system would be quite complicated, and can't be done
in a way that is transparent to applications without giving up
performance advantages. I wrote about my understanding of the
trade-offs here:

/messages/by-id/CAEepm=0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ@mail.gmail.com

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Yeah. I think tools like pgpool that already know how to connect to
the primary and look at pg_stat_replication could use the new column
to learn which servers support synchronous replay, for routing
purposes. I also think that existing read/write load balancing tools
for Python (eg "django-balancer"), Ruby (eg "makara"), Java could be
adjusted to work with this quite easily.

In response to a general question from Simon Riggs at a conference
about how anyone is supposed to use this thing in real life, I wrote a
proof-of-concept Java Spring application that shows the techniques
that I think are required to make good use of it:

https://github.com/macdice/syncreplay-spring-demo

1. Use a transaction management library (this includes Python Django
transaction management, Ruby ActiveRecord IIUC, Java Spring
declarative transactions, ...), so that whole transactions can be
retried automatically. This is generally a good idea anyway because
it lets you retry automatically on serialisation failures and deadlock
errors. The new error 40P02
ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE is just another reason to
retry, in SQL error code class "40" (or perhaps is should be "72"... I
have joked that the new error could be called "snapshot too young"!)

2. Classify transactions (= blocks of code that run a transaction) as
read-write or read-only. This can be done adaptively by remembering
ERRCODE_READ_ONLY_SQL_TRANSACTION errors from previous attempts, or
explicitly using something like Java's @Transactional(readOnly=true)
annotations, so that the transaction management library can
automatically route transactions through the right connection.

3. Automatically avoid standby servers that have recently failed with
40P02 errors.

4. Somehow know which server is the primary (my Java POC doesn't
tackle that problem, but there are various techniques, such as trying
all of them if you start seeing ERRCODE_READ_ONLY_SQL_TRANSACTION from
the server that you expected to be a primary).

The basic idea is that with a little bit of help from your
language-specific transaction management infrastructure, your
application can be 100% unaware, and benefit from load balancing. The
point is that KeyValueController.java knows nothing about any of that
stuff, and all the rest is Spring configuration that allows
transactions to be routed to N database servers. It never shows you
stale data.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Synchronous-replay-mode-for-avoiding-stale-reads--v9.patchapplication/octet-stream; name=0001-Synchronous-replay-mode-for-avoiding-stale-reads--v9.patchDownload
From 89e8bf243c902d71466df74d151ae42ec6f56faa Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 12 Apr 2017 11:02:36 +1200
Subject: [PATCH] Synchronous replay mode for avoiding stale reads on hot
 standbys.

While the existing synchronous replication support is mainly concerned with
increasing durability, synchronous replay is concerned with increasing
availability.  When two transactions tx1, tx2 are run with synchronous_replay
set to on and tx1 reports successful commit before tx2 begins, then tx2 is
guaranteed either to see tx1 or to raise a new error 40P02 if it is run on a
hot standby.

Compared to the remote_apply feature introduced by commit 314cbfc5,
synchronous replay allows for graceful failure, certainty about which
standbys can provide non-stale reads in multi-standby configurations and a
limit on how much standbys can slow the primary server down.

To make effective use of this feature, clients require some intelligence
to route read-only transactions and to avoid servers that have recently
raised error 40P02.  It is anticipated that application frameworks and
middleware will be able to provide such intelligence so that application code
can remain unaware of whether read transactions are run on different servers.

Heikki Linnakangas and Simon Riggs expressed the view that this approach is
inferior to one based on clients tracking commit LSNs and asking standby
servers to wait for replay, but other reviewers have expressed support for
both approaches being available to users.

Author: Thomas Munro
Reviewed-By: Dmitry Dolgov, Thom Brown, Amit Langote, Simon Riggs,
             Joel Jacobson, Heikki Linnakangas, Michael Paquier, Simon Riggs,
             Robert Haas, Ants Aasma, Masahiko Sawada
Discussion: https://postgr.es/m/CAEepm%3D0Q6kCKMYFBN%2BVv2frPc%3D3cS3T1MPOxnZ9do8%2BNHzoJTA%40mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  87 +++
 doc/src/sgml/high-availability.sgml           | 139 ++++-
 doc/src/sgml/monitoring.sgml                  |  12 +
 src/backend/access/transam/xact.c             |   2 +-
 src/backend/catalog/system_views.sql          |   3 +-
 src/backend/postmaster/pgstat.c               |   6 +
 src/backend/replication/logical/worker.c      |   2 +
 src/backend/replication/syncrep.c             | 500 ++++++++++++++----
 src/backend/replication/walreceiver.c         |  82 ++-
 src/backend/replication/walreceiverfuncs.c    |  19 +
 src/backend/replication/walsender.c           | 364 +++++++++++--
 src/backend/utils/errcodes.txt                |   1 +
 src/backend/utils/misc/guc.c                  |  43 ++
 src/backend/utils/misc/postgresql.conf.sample |  19 +
 src/backend/utils/time/snapmgr.c              |  13 +
 src/bin/pg_basebackup/pg_recvlogical.c        |   6 +-
 src/bin/pg_basebackup/receivelog.c            |   5 +-
 src/include/catalog/pg_proc.dat               |   6 +-
 src/include/pgstat.h                          |   6 +-
 src/include/replication/syncrep.h             |  16 +-
 src/include/replication/walreceiver.h         |   9 +
 src/include/replication/walsender_private.h   |  20 +
 src/test/regress/expected/rules.out           |   5 +-
 23 files changed, 1205 insertions(+), 160 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2e5a5cd331b..79806cc4b1b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3433,6 +3433,36 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</varname> set to <literal>on</literal> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"/> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Servers</title>
 
@@ -3751,6 +3781,63 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby.
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</firstterm>, as described in
+        <xref linkend="synchronous-replay"/>.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</literal></link>.
+        The default is <literal>*</literal>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</varname>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index d8fd195da09..1bc4b35a028 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1154,11 +1154,12 @@ primary_slot_name = 'node_a_slot'
    </para>
 
    <para>
-    Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
-    cause each commit to wait until the current synchronous standbys report
-    that they have replayed the transaction, making it visible to user
-    queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    Setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> will cause each commit to wait until
+    the current synchronous standbys report that they have replayed the
+    transaction, making it visible to user queries.  In simple cases, this
+    allows for load balancing with causal consistency.  See also
+    <xref linkend="synchronous-replay"/>.
    </para>
 
    <para>
@@ -1356,6 +1357,122 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</varname> to <literal>on</literal> in any
+    pair of consecutive transactions tx1, tx2 where tx2 begins after tx1
+    successfully returns, then tx2 will either see tx1 or fail with a new
+    error "standby is not available for synchronous replay", no matter which
+    server it runs on.  Although the guarantee is expressed in terms of two
+    individual transactions, the GUC can also be set at session, role or
+    system level to make the guarantee generally, allowing for load balancing
+    of applications that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the
+    feature, <varname>synchronous_replay_max_lag</varname> must be set to a
+    non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</varname> can be used to
+    limit the set of standbys that can join the dynamic set of synchronous
+    replay standbys by providing a comma-separated list of application names.
+    By default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</literal></link>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</varname>.  For example,
+    if <varname>synchronous_replay_lease_time</varname> is set
+    to <literal>5s</literal>, then the clocks must not be more than 1.25
+    second apart for the guarantee to be upheld reliably during transitions.
+    The ubiquity of the Network Time Protocol (NTP) on modern operating
+    systems and availability of high quality time servers makes it possible to
+    choose a tolerance significantly higher than the maximum expected clock
+    difference.  An effort is nevertheless made to detect and report
+    misconfigured and faulty systems with clock differences greater than the
+    configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</varname> below the
+     default of 5 seconds (allowing up to 1.25 second of clock difference)
+     after researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1701,7 +1818,17 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</firstterm> with respect to a
+    transaction that has completed on the primary by
+    setting <varname>synchronous_replay</varname> to <literal>on</literal> in
+    both transactions.  For more details,
+    see <xref linkend="synchronous-replay"/>.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 7aada144179..a9efa93813e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1920,6 +1920,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Synchronous replay state of this standby server.  This field will
+     be non-null only if <varname>synchronous_replay_max_lag</varname> is set.
+     If a standby is in <literal>available</literal> state, then it can
+     currently serve synchronous replay queries.  If it is not replaying fast
+     enough or not responding to keepalive messages, it will be
+     in <literal>unavailable</literal> state, and if it is currently
+     transitioning to availability it will be in <literal>joining</literal>
+     state for a short time.</entry>
+    </row>
    </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400384b..fa2b28634ba 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5297,7 +5297,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 715995dd883..df283513372 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -734,7 +734,8 @@ CREATE VIEW pg_stat_replication AS
             W.flush_lag,
             W.replay_lag,
             W.sync_priority,
-            W.sync_state
+            W.sync_state,
+            W.sync_replay
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8676088e57d..ea6664cb925 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3682,6 +3682,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
@@ -3710,6 +3713,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8d5e0946c4b..3f7fb52f7de 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1201,6 +1201,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						TimestampTz timestamp;
 						bool		reply_requested;
 
+						(void) pq_getmsgint64(&s); /* skip messageNumber */
 						end_lsn = pq_getmsgint64(&s);
 						timestamp = pq_getmsgint64(&s);
 						reply_requested = pq_getmsgbyte(&s);
@@ -1404,6 +1405,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 5b8a268fa16..f9744584338 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -128,6 +137,230 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===========================================================
  */
 
+/*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+							 WAIT_EVENT_SYNC_REPLAY);
+		else
+			(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH |
+							 WL_TIMEOUT,
+							 stallTimeMillis,
+							 WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
@@ -149,11 +382,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication.
@@ -167,6 +398,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -214,8 +451,6 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 */
 	for (;;)
 	{
-		int			rc;
-
 		/* Must reset the latch before testing state. */
 		ResetLatch(MyLatch);
 
@@ -229,64 +464,16 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
 		 * latch, so no need for timeout.
 		 */
-		rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
-					   WAIT_EVENT_SYNC_REP);
-
-		/*
-		 * If the postmaster dies, we'll probably never get an
-		 * acknowledgment, because all the wal sender processes will exit. So
-		 * just bail out.
-		 */
-		if (rc & WL_POSTMASTER_DEATH)
-		{
-			ProcDiePending = true;
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
-			break;
-		}
+		(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1,
+						 WAIT_EVENT_SYNC_REP);
 	}
 
 	/*
@@ -401,15 +588,66 @@ SyncRepInitConfig(void)
 	}
 }
 
+/*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
 /*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -423,15 +661,17 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.  Streaming or stopping WAL
-	 * senders are allowed to release waiters.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		(MyWalSnd->state != WALSNDSTATE_STREAMING &&
-		 MyWalSnd->state != WALSNDSTATE_STOPPING) ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 (MyWalSnd->state != WALSNDSTATE_STREAMING &&
+		  MyWalSnd->state != WALSNDSTATE_STOPPING) ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -469,9 +709,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -480,24 +721,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -997,9 +1250,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1016,7 +1268,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1083,7 +1335,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1134,6 +1386,51 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1203,6 +1500,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9643c2ed7b3..8934aee543c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -58,6 +58,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -140,9 +141,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -476,7 +478,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -521,7 +523,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_TIMEOUT)
@@ -570,7 +572,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -862,6 +864,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -881,7 +885,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -891,7 +895,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -899,15 +904,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1070,7 +1077,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1088,9 +1095,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1137,6 +1147,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1269,15 +1280,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1285,6 +1337,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1322,7 +1376,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 67b1a074cce..600f974668c 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -376,3 +377,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 46edb525e88..7f7520f6522 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -173,6 +173,18 @@ static TimestampTz last_reply_timestamp = 0;
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -244,7 +256,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(void);
 static void WalSndCheckTimeOut(void);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -287,6 +299,61 @@ InitWalSender(void)
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
 }
 
+/*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
 /*
  * Clean up after an error.
  *
@@ -315,7 +382,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -327,6 +397,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1600,6 +1672,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1616,6 +1689,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1665,6 +1739,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1762,9 +1837,11 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1772,6 +1849,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	(void) pq_getmsgint64(&reply_message);	/* sendTime; not used ATM */
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X%s",
 		 (uint32) (writePtr >> 32), (uint32) writePtr,
@@ -1786,17 +1864,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1812,8 +1890,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1824,11 +1947,55 @@ ProcessStandbyReplyMessage(void)
 			walsnd->flushLag = flushLag;
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2018,33 +2185,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_reply_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2068,20 +2254,33 @@ WalSndComputeSleeptime(TimestampTz now)
  * message every standby_message_timeout = wal_sender_timeout/6 = 10s.  We
  * could eliminate that problem by recognizing timeout expiration at
  * wal_sender_timeout/2 after the keepalive.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(void)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && last_processing >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && last_processing >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2106,6 +2305,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	last_reply_timestamp = GetCurrentTimestamp();
 	waiting_for_ping_response = false;
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2264,6 +2466,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
@@ -3160,6 +3363,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3179,7 +3403,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	11
+#define PG_STAT_GET_WAL_SENDERS_COLS	12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3233,6 +3457,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			priority;
 		int			pid;
 		WalSndState state;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3245,6 +3470,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3328,6 +3554,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[10] = CStringGetTextDatum("potential");
+
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3343,21 +3572,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3372,19 +3649,30 @@ WalSndKeepaliveIfNecessary(void)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (last_processing >= ping_time)
 	{
 		WalSndKeepalive(true);
@@ -3428,7 +3716,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == lag_tracker->read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 788f88129bd..bf96ebc825c 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -308,6 +308,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 03594e77fee..9ffdb1fd35e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1759,6 +1759,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
@@ -3120,6 +3130,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3973,6 +4005,17 @@ static struct config_string ConfigureNamesString[] =
 		check_synchronous_standby_names, assign_synchronous_standby_names, NULL
 	},
 
+	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
 	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1fa02d2c938..a9ca77d87de 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -296,6 +296,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -334,6 +345,14 @@
 					# (change requires restart)
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29d..944cc7d4949 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -331,6 +333,17 @@ GetTransactionSnapshot(void)
 			elog(ERROR,
 				 "cannot take query snapshot during a parallel operation");
 
+		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
 		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index a242e0be88b..2101243d155 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -118,7 +118,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -151,6 +151,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
@@ -470,6 +472,8 @@ StreamLogicalLog(void)
 			 * rest.
 			 */
 			pos = 1;			/* skip msgtype 'k' */
+			pos += 8;			/* skip messageNumber */
+
 			walEnd = fe_recvint64(&copybuf[pos]);
 			output_written_lsn = Max(walEnd, output_written_lsn);
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 10768786301..a801224ad94 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -328,7 +328,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -346,6 +346,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
@@ -1016,6 +1018,7 @@ ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 	 * check if the server requested a reply, and ignore the rest.
 	 */
 	pos = 1;					/* skip msgtype 'k' */
+	pos += 8;					/* skip messageNumber */
 	pos += 8;					/* skip walEnd */
 	pos += 8;					/* skip sendTime */
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 034a41eb556..ca260d158f5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5023,9 +5023,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,text}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d16b8b..70ed2ca9ef3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -833,7 +833,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY
 } WaitEventIPC;
 
 /* ----------
@@ -846,7 +847,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index bc43b4e1090..6a5bfcbb9ce 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 5913b580c2b..58709e2e9be 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	XLogRecPtr	receivedUpto;
 	TimeLineID	receivedTLI;
 
+	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
@@ -313,4 +320,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 4b904779361..0909a64bdad 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -101,6 +113,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 735dd37acff..bf32925b67f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1861,9 +1861,10 @@ pg_stat_replication| SELECT s.pid,
     w.flush_lag,
     w.replay_lag,
     w.sync_priority,
-    w.sync_state
+    w.sync_state,
+    w.sync_replay
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.19.1

#11Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#10)
Re: Synchronous replay take III

On Sat, Dec 01, 2018 at 02:48:29PM +1300, Thomas Munro wrote:

On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Unfortunately, cfbot says that patch can't be applied without conflicts, could
you please post a rebased version and address commentaries from Masahiko?

Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death. I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version. If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

Hm. This stuff runs under many assumptions, so I think that we should
be careful here with any changes as the very recent history has proved
(4c70336). If we were to switch WAL senders on postmaster death, I
think that this could be a change independent of what is proposed here.
--
Michael

#12Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Michael Paquier (#11)
Re: Synchronous replay take III

Hello.

It is really nice feature. I am working on the project which heavily reads
from replicas (6 of them).

In our case we have implemented some kind of "replication barrier"
functionality based on table with counters (one counter per application
backend in simple case).
Each application backend have dedicated connection to each replica. And it
selects its counter value few times (2-100) per second from each replica in
background process (depending on how often replication barrier is used).

Once application have committed transaction it may want join replication
barrier before return new data to a user. So, it increments counter in the
table and waits until all replicas have replayed that value according to
background monitoring process. Of course timeout, replicas health checks
and few optimizations and circuit breakers are used.

Nice thing here - constant number of connection involved. Even if lot of
threads joining replication barrier in the moment. Even if some replicas
are lagging.

Because 2-5 seconds lag of some replica will lead to out of connections
issue in few milliseconds in case of implementation described in this
thread.
It may be the weak part of the patch I think. At least for our case. But it
possible could be used to eliminate odd table with counters in my case (if
it possible to change setting per transaction).

Thanks a lot,
Michail.

#13Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Michail Nikolaev (#12)
2 attachment(s)
Re: Synchronous replay take III

Hello,

Here is a rebased patch, and separate replies to Michael and Michail.

On Sat, Dec 1, 2018 at 4:57 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sat, Dec 01, 2018 at 02:48:29PM +1300, Thomas Munro wrote:

Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death. I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version. If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

Hm. This stuff runs under many assumptions, so I think that we should
be careful here with any changes as the very recent history has proved
(4c70336). If we were to switch WAL senders on postmaster death, I
think that this could be a change independent of what is proposed here.

Fair point. I think the effect should be the same with less code:
either way you see the server hang up without sending a COMMIT tag,
but maybe I'm missing something. Change reverted; let's discuss that
another time.

On Mon, Dec 3, 2018 at 9:01 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:

It is really nice feature. I am working on the project which heavily reads from replicas (6 of them).

Thanks for your feedback.

In our case we have implemented some kind of "replication barrier" functionality based on table with counters (one counter per application backend in simple case).
Each application backend have dedicated connection to each replica. And it selects its counter value few times (2-100) per second from each replica in background process (depending on how often replication barrier is used).

Interesting approach. Why don't you sample pg_last_wal_replay_lsn()
on all the standbys instead, so you don't have to generate extra write
traffic?

Once application have committed transaction it may want join replication barrier before return new data to a user. So, it increments counter in the table and waits until all replicas have replayed that value according to background monitoring process. Of course timeout, replicas health checks and few optimizations and circuit breakers are used.

I'm interested in how you handle failure (taking too long to respond
or to see the new counter value, connectivity failure etc).
Specifically, if the writer decides to give up on a certain standby
(timeout, circuit breaker etc), how should a client that is connected
directly to that standby now or soon afterwards know that this standby
has been 'dropped' from the replication barrier and it's now at risk
of seeing stale data? My patch handles this by cancelling standbys'
leases explicitly and waiting for a response (if possible), but
otherwise waiting for them to expire (say if connectivity is lost or
standby has gone crazy or stopped responding), so that there is no
scenario where someone can successfully execute queries on a standby
that hasn't applied a transaction that you know to be committed on the
primary.

Nice thing here - constant number of connection involved. Even if lot of threads joining replication barrier in the moment. Even if some replicas are lagging.

Because 2-5 seconds lag of some replica will lead to out of connections issue in few milliseconds in case of implementation described in this thread.

Right, if a standby is lagging more than the allowed amount, in my
patch the lease is cancelled and it will refuse to handle requests if
the GUC is on, with a special new error code, and then it's up to the
client to decide what to do. Probably find another node.

It may be the weak part of the patch I think. At least for our case.

Could you please elaborate? What could you do that would be better?
If the answer is that you just want to know that you might be seeing
stale data but for some reason you don't want to have to find a new
node, the reader is welcome to turn synchronous_standby off and try
again (giving up data freshness guarantees). Not sure when that would
be useful though.

But it possible could be used to eliminate odd table with counters in my case (if it possible to change setting per transaction).

Yes, the behaviour can be activated per transaction, using the usual
GUC scoping rules. The setting synchronous_replay must be on in both
the write transaction and the following read transaction for the logic
to work (ie for the writer to wait, and for the reader to make sure
that it has a valid lease or raise an error).

It sounds like my synchronous_replay GUC is quite similar to your
replication barrier system, except that it has a way to handle node
failure and excessive lag without abandoning the guarantee.

I've attached a small shell script that starts up a primary and N
replicas with synchronous_replay configured, in the hope of
encouraging you to try it out.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Synchronous-replay-mode-for-avoiding-stale-reads-v10.patchapplication/octet-stream; name=0001-Synchronous-replay-mode-for-avoiding-stale-reads-v10.patchDownload
From 7e2e4befddc191b0186d7dc3257752ab21973d89 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 12 Apr 2017 11:02:36 +1200
Subject: [PATCH] Synchronous replay mode for avoiding stale reads on hot
 standbys.

While the existing synchronous replication support is mainly concerned with
increasing durability, synchronous replay is concerned with increasing
availability.  When two transactions tx1, tx2 are run with synchronous_replay
set to on and tx1 reports successful commit before tx2 begins, then tx2 is
guaranteed either to see tx1 or to raise a new error 40P02 if it is run on a
hot standby.

Compared to the remote_apply feature introduced by commit 314cbfc5,
synchronous replay allows for graceful failure, certainty about which
standbys can provide non-stale reads in multi-standby configurations and a
limit on how much standbys can slow the primary server down.

To make effective use of this feature, clients require some intelligence
to route read-only transactions and to avoid servers that have recently
raised error 40P02.  It is anticipated that application frameworks and
middleware will be able to provide such intelligence so that application code
can remain unaware of whether read transactions are run on different servers.

Heikki Linnakangas and Simon Riggs expressed the view that this approach is
inferior to one based on clients tracking commit LSNs and asking standby
servers to wait for replay, but other reviewers have expressed support for
both approaches being available to users.

Author: Thomas Munro
Reviewed-By: Dmitry Dolgov, Thom Brown, Amit Langote, Simon Riggs,
             Joel Jacobson, Heikki Linnakangas, Michael Paquier, Simon Riggs,
             Robert Haas, Ants Aasma, Masahiko Sawada
Discussion: https://postgr.es/m/CAEepm%3D0Q6kCKMYFBN%2BVv2frPc%3D3cS3T1MPOxnZ9do8%2BNHzoJTA%40mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  87 ++++
 doc/src/sgml/high-availability.sgml           | 139 ++++-
 doc/src/sgml/monitoring.sgml                  |  12 +
 src/backend/access/transam/xact.c             |   2 +-
 src/backend/catalog/system_views.sql          |   1 +
 src/backend/postmaster/pgstat.c               |   6 +
 src/backend/replication/logical/worker.c      |   2 +
 src/backend/replication/syncrep.c             | 491 +++++++++++++++---
 src/backend/replication/walreceiver.c         |  82 ++-
 src/backend/replication/walreceiverfuncs.c    |  19 +
 src/backend/replication/walsender.c           | 368 +++++++++++--
 src/backend/utils/errcodes.txt                |   1 +
 src/backend/utils/misc/guc.c                  |  43 ++
 src/backend/utils/misc/postgresql.conf.sample |  19 +
 src/backend/utils/time/snapmgr.c              |  13 +
 src/bin/pg_basebackup/pg_recvlogical.c        |   6 +-
 src/bin/pg_basebackup/receivelog.c            |   5 +-
 src/include/catalog/pg_proc.dat               |   6 +-
 src/include/pgstat.h                          |   6 +-
 src/include/replication/syncrep.h             |  16 +-
 src/include/replication/walreceiver.h         |   9 +
 src/include/replication/walsender_private.h   |  20 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 1213 insertions(+), 143 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e94b305add0..43c54d85d3d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3432,6 +3432,36 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</varname> set to <literal>on</literal> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"/> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Servers</title>
 
@@ -3750,6 +3780,63 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby.
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</firstterm>, as described in
+        <xref linkend="synchronous-replay"/>.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</literal></link>.
+        The default is <literal>*</literal>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</varname>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index d8fd195da09..1bc4b35a028 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1154,11 +1154,12 @@ primary_slot_name = 'node_a_slot'
    </para>
 
    <para>
-    Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
-    cause each commit to wait until the current synchronous standbys report
-    that they have replayed the transaction, making it visible to user
-    queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    Setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> will cause each commit to wait until
+    the current synchronous standbys report that they have replayed the
+    transaction, making it visible to user queries.  In simple cases, this
+    allows for load balancing with causal consistency.  See also
+    <xref linkend="synchronous-replay"/>.
    </para>
 
    <para>
@@ -1356,6 +1357,122 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</varname> to <literal>on</literal> in any
+    pair of consecutive transactions tx1, tx2 where tx2 begins after tx1
+    successfully returns, then tx2 will either see tx1 or fail with a new
+    error "standby is not available for synchronous replay", no matter which
+    server it runs on.  Although the guarantee is expressed in terms of two
+    individual transactions, the GUC can also be set at session, role or
+    system level to make the guarantee generally, allowing for load balancing
+    of applications that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the
+    feature, <varname>synchronous_replay_max_lag</varname> must be set to a
+    non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</varname> can be used to
+    limit the set of standbys that can join the dynamic set of synchronous
+    replay standbys by providing a comma-separated list of application names.
+    By default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</literal></link>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</varname>.  For example,
+    if <varname>synchronous_replay_lease_time</varname> is set
+    to <literal>5s</literal>, then the clocks must not be more than 1.25
+    second apart for the guarantee to be upheld reliably during transitions.
+    The ubiquity of the Network Time Protocol (NTP) on modern operating
+    systems and availability of high quality time servers makes it possible to
+    choose a tolerance significantly higher than the maximum expected clock
+    difference.  An effort is nevertheless made to detect and report
+    misconfigured and faulty systems with clock differences greater than the
+    configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</varname> below the
+     default of 5 seconds (allowing up to 1.25 second of clock difference)
+     after researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1701,7 +1818,17 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</firstterm> with respect to a
+    transaction that has completed on the primary by
+    setting <varname>synchronous_replay</varname> to <literal>on</literal> in
+    both transactions.  For more details,
+    see <xref linkend="synchronous-replay"/>.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 96bcc3a63be..3cbb8f559fa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1920,6 +1920,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Synchronous replay state of this standby server.  This field will
+     be non-null only if <varname>synchronous_replay_max_lag</varname> is set.
+     If a standby is in <literal>available</literal> state, then it can
+     currently serve synchronous replay queries.  If it is not replaying fast
+     enough or not responding to keepalive messages, it will be
+     in <literal>unavailable</literal> state, and if it is currently
+     transitioning to availability it will be in <literal>joining</literal>
+     state for a short time.</entry>
+    </row>
     <row>
      <entry><structfield>reply_time</structfield></entry>
      <entry><type>timestamp with time zone</type></entry>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400384b..fa2b28634ba 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5297,7 +5297,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5253837b544..64e88ea7d5b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -735,6 +735,7 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
+            W.sync_replay,
             W.reply_time
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8676088e57d..ea6664cb925 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3682,6 +3682,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
@@ -3710,6 +3713,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8d5e0946c4b..3f7fb52f7de 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1201,6 +1201,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						TimestampTz timestamp;
 						bool		reply_requested;
 
+						(void) pq_getmsgint64(&s); /* skip messageNumber */
 						end_lsn = pq_getmsgint64(&s);
 						timestamp = pq_getmsgint64(&s);
 						reply_requested = pq_getmsgbyte(&s);
@@ -1404,6 +1405,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 5b8a268fa16..2392edf5eb5 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -128,6 +137,240 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===========================================================
  */
 
+/*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		int			rc;
+
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+						   WAIT_EVENT_SYNC_REPLAY);
+		else
+			rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH |
+						   WL_TIMEOUT,
+						   stallTimeMillis,
+						   WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+
+		if (rc & WL_POSTMASTER_DEATH)
+		{
+			ProcDiePending = true;
+			whereToSendOutput = DestNone;
+			SyncRepCancelWait();
+			break;
+		}
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
@@ -149,11 +392,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication.
@@ -167,6 +408,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -229,44 +476,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -401,15 +613,66 @@ SyncRepInitConfig(void)
 	}
 }
 
+/*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
 /*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -423,15 +686,17 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.  Streaming or stopping WAL
-	 * senders are allowed to release waiters.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		(MyWalSnd->state != WALSNDSTATE_STREAMING &&
-		 MyWalSnd->state != WALSNDSTATE_STOPPING) ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 (MyWalSnd->state != WALSNDSTATE_STREAMING &&
+		  MyWalSnd->state != WALSNDSTATE_STOPPING) ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -469,9 +734,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -480,24 +746,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -997,9 +1275,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1016,7 +1293,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1083,7 +1360,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1134,6 +1411,51 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1203,6 +1525,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9643c2ed7b3..8934aee543c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -58,6 +58,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -140,9 +141,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -476,7 +478,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -521,7 +523,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_TIMEOUT)
@@ -570,7 +572,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -862,6 +864,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -881,7 +885,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -891,7 +895,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -899,15 +904,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1070,7 +1077,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1088,9 +1095,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1137,6 +1147,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1269,15 +1280,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1285,6 +1337,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1322,7 +1376,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 67b1a074cce..600f974668c 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -376,3 +377,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d1a8113cb66..a10e320a5a2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -173,6 +173,18 @@ static TimestampTz last_reply_timestamp = 0;
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -244,7 +256,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(void);
 static void WalSndCheckTimeOut(void);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -287,6 +299,61 @@ InitWalSender(void)
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
 }
 
+/*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
 /*
  * Clean up after an error.
  *
@@ -315,7 +382,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -327,6 +397,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1600,6 +1672,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1616,6 +1689,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1665,6 +1739,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1762,10 +1837,12 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 	TimestampTz replyTime;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1773,6 +1850,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	replyTime = pq_getmsgint64(&reply_message);
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	if (log_min_messages <= DEBUG2)
 	{
@@ -1798,17 +1876,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1824,8 +1902,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1837,11 +1960,55 @@ ProcessStandbyReplyMessage(void)
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
 		walsnd->replyTime = replyTime;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2055,33 +2222,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_reply_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2105,20 +2291,33 @@ WalSndComputeSleeptime(TimestampTz now)
  * message every standby_message_timeout = wal_sender_timeout/6 = 10s.  We
  * could eliminate that problem by recognizing timeout expiration at
  * wal_sender_timeout/2 after the keepalive.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(void)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && last_processing >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && last_processing >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2143,6 +2342,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	last_reply_timestamp = GetCurrentTimestamp();
 	waiting_for_ping_response = false;
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2301,6 +2503,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
 			SpinLockRelease(&walsnd->mutex);
@@ -3198,6 +3401,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3217,7 +3441,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	13
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3272,6 +3496,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			pid;
 		WalSndState state;
 		TimestampTz replyTime;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3284,6 +3509,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3369,10 +3595,13 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			else
 				values[10] = CStringGetTextDatum("potential");
 
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
+
 			if (replyTime == 0)
-				nulls[11] = true;
+				nulls[12] = true;
 			else
-				values[11] = TimestampTzGetDatum(replyTime);
+				values[12] = TimestampTzGetDatum(replyTime);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3388,21 +3617,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3417,19 +3694,30 @@ WalSndKeepaliveIfNecessary(void)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (last_processing >= ping_time)
 	{
 		WalSndKeepalive(true);
@@ -3473,7 +3761,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == lag_tracker->read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 788f88129bd..bf96ebc825c 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -308,6 +308,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6fe19398812..11173f33594 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1759,6 +1759,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
@@ -3120,6 +3130,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -3973,6 +4005,17 @@ static struct config_string ConfigureNamesString[] =
 		check_synchronous_standby_names, assign_synchronous_standby_names, NULL
 	},
 
+	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
 	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1fa02d2c938..a9ca77d87de 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -296,6 +296,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -334,6 +345,14 @@
 					# (change requires restart)
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29d..944cc7d4949 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -54,6 +54,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -331,6 +333,17 @@ GetTransactionSnapshot(void)
 			elog(ERROR,
 				 "cannot take query snapshot during a parallel operation");
 
+		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
 		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index a242e0be88b..2101243d155 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -118,7 +118,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -151,6 +151,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
@@ -470,6 +472,8 @@ StreamLogicalLog(void)
 			 * rest.
 			 */
 			pos = 1;			/* skip msgtype 'k' */
+			pos += 8;			/* skip messageNumber */
+
 			walEnd = fe_recvint64(&copybuf[pos]);
 			output_written_lsn = Max(walEnd, output_written_lsn);
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 10768786301..a801224ad94 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -328,7 +328,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -346,6 +346,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
@@ -1016,6 +1018,7 @@ ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 	 * check if the server requested a reply, and ignore the rest.
 	 */
 	pos = 1;					/* skip msgtype 'k' */
+	pos += 8;					/* skip messageNumber */
 	pos += 8;					/* skip walEnd */
 	pos += 8;					/* skip sendTime */
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index acb0154048a..fb4b16848bb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5070,9 +5070,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,text,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay,reply_time}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d16b8b..70ed2ca9ef3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -833,7 +833,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY
 } WaitEventIPC;
 
 /* ----------
@@ -846,7 +847,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index bc43b4e1090..6a5bfcbb9ce 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 5913b580c2b..58709e2e9be 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	XLogRecPtr	receivedUpto;
 	TimeLineID	receivedTLI;
 
+	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
@@ -313,4 +320,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 53314b1fae5..5875f288316 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -106,6 +118,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e384cd22798..842047fdaa2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1862,9 +1862,10 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
+    w.sync_replay,
     w.reply_time
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, sslclientdn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay, reply_time) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.19.1

test-synchronous-replay.shapplication/x-sh; name=test-synchronous-replay.shDownload
#14Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Thomas Munro (#10)
Re: Synchronous replay take III

On Sat, Dec 1, 2018 at 10:49 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Unfortunately, cfbot says that patch can't be applied without conflicts, could
you please post a rebased version and address commentaries from Masahiko?

Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death. I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version. If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

Thank you for working on this. The overview and your summary was
helpful for me to understand this feature, thank you. I've started to
review this patch for PostgreSQL 12. I've tested this patch and found
some issue but let me ask you questions about the high-level design
first. Sorry if these have been already discussed.

Thanks for your interest in this work!

This is a design choice favouring read-mostly workloads at the expense
of write transactions. Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server. This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

To be clear what did you mean read-mostly workloads?

I mean workloads where only a small percentage of transactions perform
a write. If you need write-scalability, then hot_standby is not the
solution for you (with or without this patch).

The kind of user who would be interested in this feature is someone
who already uses some kind of heuristics to move some queries to
read-only standbys. For example, some people send transaction for
logged-in users to the primary database (because only logged-in users
generate write queries), and all the rest to standby servers (for
example "public" users who can only read content). Another technique
I have seen is to keep user sessions "pinned" on the primary server
for N minutes after they perform a write transaction. These types of
load balancing policies are primitive ways of achieving
read-your-writes consistency, but they are conservative and
pessimistic: they probably send too many queries to the primary node.

This proposal is much more precise, allowing you to run the minimum
number of transactions on the primary node (ie transactions that
actually need to perform a write), and the maximum number of
transactions on the hot standbys.

As discussed, making reads wait for a token would be a useful
alternative (and I am willing to help make that work too), but:

1. For users that do more many more reads than writes, would you
rather make (say) 80% of transactions slower or 20%? (Or 99% vs 1% as
the case may be, depending on your application.)

2. If you are also using synchronous_commit = on for increased
durability, then you are already making writers wait, and you might be
able to tolerate a small increase.

Peter Eisentraut expressed an interesting point of view against this
general line of thinking:

/messages/by-id/5643933F.4010701@gmx.net

My questions are: Why do we have hot_standby mode? Is load balancing
a style of usage we want to support? Do we want a technology that
lets you do more of it?

I think there are two kind of reads on standbys: a read happend after
writes and a directly read (e.g. reporting). The former usually
requires the causal reads as you mentioned in order to read its own
writes but the latter might be different: it often wants to read the
latest data on the master at the time. IIUC even if we send a
read-only query directly to a synchronous replay server we could get a
stale result if the standby delayed for less than
synchronous_replay_max_lag. So this synchronous replay feature would
be helpful for the former case(i.e. a few writes and many reads wants
to see them) whereas for the latter case perhaps the keeping the reads
waiting on standby seems a reasonable solution.

I agree 100% that this is not a solution for all users. But I also
suspect a token system would be quite complicated, and can't be done
in a way that is transparent to applications without giving up
performance advantages. I wrote about my understanding of the
trade-offs here:

/messages/by-id/CAEepm=0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ@mail.gmail.com

Thank you for explaning. I understood the use-cases of this feature
and token-based causal reads.

Also I think it's worth to consider the cost both causal reads *and*
non-causal reads.

I've considered a mixed workload (transactions requiring causal reads
and transactions not requiring it) on the current design. IIUC the
current design seems like that we create something like
consistent-reads group by specifying servers. For example, if a
transaction doesn't want to causality read it can send query any
server with synchronous_replay = off but if it wants, it should select
a synchronous replay server. It also means that client applications or
routing middlewares such as pgpool is required to be aware of
available synchronous replay standbys. That is, this design would cost
the read-only transactions requiring causal reads. On the other hand,
in token-based causal reads we can send read-only query any standbys
if we can wait for the change to be replayed. Of course if we don't
wait forever we can timeout and switch to either another standby or
the master to execute query but we don't need to choose a server of
standby servers.

Yeah. I think tools like pgpool that already know how to connect to
the primary and look at pg_stat_replication could use the new column
to learn which servers support synchronous replay, for routing
purposes. I also think that existing read/write load balancing tools
for Python (eg "django-balancer"), Ruby (eg "makara"), Java could be
adjusted to work with this quite easily.

Agreed.

In response to a general question from Simon Riggs at a conference
about how anyone is supposed to use this thing in real life, I wrote a
proof-of-concept Java Spring application that shows the techniques
that I think are required to make good use of it:

https://github.com/macdice/syncreplay-spring-demo

1. Use a transaction management library (this includes Python Django
transaction management, Ruby ActiveRecord IIUC, Java Spring
declarative transactions, ...), so that whole transactions can be
retried automatically. This is generally a good idea anyway because
it lets you retry automatically on serialisation failures and deadlock
errors. The new error 40P02
ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE is just another reason to
retry, in SQL error code class "40" (or perhaps is should be "72"... I
have joked that the new error could be called "snapshot too young"!)

2. Classify transactions (= blocks of code that run a transaction) as
read-write or read-only. This can be done adaptively by remembering
ERRCODE_READ_ONLY_SQL_TRANSACTION errors from previous attempts, or
explicitly using something like Java's @Transactional(readOnly=true)
annotations, so that the transaction management library can
automatically route transactions through the right connection.

3. Automatically avoid standby servers that have recently failed with
40P02 errors.

4. Somehow know which server is the primary (my Java POC doesn't
tackle that problem, but there are various techniques, such as trying
all of them if you start seeing ERRCODE_READ_ONLY_SQL_TRANSACTION from
the server that you expected to be a primary).

The basic idea is that with a little bit of help from your
language-specific transaction management infrastructure, your
application can be 100% unaware, and benefit from load balancing. The
point is that KeyValueController.java knows nothing about any of that
stuff, and all the rest is Spring configuration that allows
transactions to be routed to N database servers. It never shows you
stale data.

Thank you! I'll try it.

Regarding the current (v10 patch) design I have some questions and
comments.

The patch introduces new GUC parameter synchronous_replay. We can set
synchronous_commit = off while setting synchronous_replay = on. With
this setting, the backend will synchrnously wait for standbys to
replay. I'm concerned that having two separate GUC parameters
controling the transaction commit behaviour would confuse users. It's
a just idea but maybe we can use 'remote_apply' for synchronous replay
purpose and introduce new parameter for standby server something like
allow_stale_read.

If while a transaction is waiting for all standbys to replay they
became to unavailable state, should the waiter be released? the patch
seems not to release the waiter. Similarly, wal senders are not aware
of postgresql.conf change while waiting synchronous replay. I think we
should call SyncReplayPotentialStandby() in SyncRepInitConfig().

With the setting synchronous_standby_names = '' and
synchronous_replay_standby_names = '*' we would get the standby's
status in pg_stat_replication, sync_state = 'async' and sync_replay =
'available'. It looks odd to me. Yes, this status is correct in
principle. But considering the architecture of PostgreSQL replication
this status is impossible.

The synchronous_replay_standby_name = '*' setting means that the
backend wait for all standbys connected to the master server to
replay, is that right? In my test, even when some of synchronous
replay standby servers got stuck and then therefore are revoked their
lease, the backend could proceed transactions.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#15Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Masahiko Sawada (#14)
Re: Synchronous replay take III

On Tue, Jan 15, 2019 at 11:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Regarding the current (v10 patch) design I have some questions and
comments.

Hi Sawada-san,

Thanks for your testing and feedback.

The patch introduces new GUC parameter synchronous_replay. We can set
synchronous_commit = off while setting synchronous_replay = on. With
this setting, the backend will synchrnously wait for standbys to
replay. I'm concerned that having two separate GUC parameters
controling the transaction commit behaviour would confuse users. It's
a just idea but maybe we can use 'remote_apply' for synchronous replay
purpose and introduce new parameter for standby server something like
allow_stale_read.

That is an interesting idea. That choice means that the new mode
always implies synchronous_commit = on (since remote_apply is a
"higher" level). I wanted them to be independent, so you could
express your durability requirement separately from your visibility
requirement.

Concretely, if none of your potential sync replay standbys are keeping
up and they are all dropped to "unavailable", then you'd be able to
see a difference: with your proposal we'd still have a synchronous
commit wait, but with mine that could independently be on or off.

Generally, I think we are too restrictive in our durability levels,
and there was some discussion about whether it's OK to have a strict
linear knob (which your idea extends):

/messages/by-id/CAEepm=3FFaanSS4sugG+Apzq2tCVjEYCO2wOQBod2d7GWb=DvA@mail.gmail.com

Hmm, perhaps your way would be better for now anyway, just because
it's simpler to understand and explain. Perhaps you wouldn't need a
separate "allow_stale_read" GUC, you could just set synchronous_commit
to a lower level when talking to the standby. (That is, give
synchronous_commit a meaning on standbys, whereas currently it has no
effect there.)

If while a transaction is waiting for all standbys to replay they
became to unavailable state, should the waiter be released? the patch
seems not to release the waiter. Similarly, wal senders are not aware
of postgresql.conf change while waiting synchronous replay. I think we
should call SyncReplayPotentialStandby() in SyncRepInitConfig().

Good point about the postgresql.conf change.

If all the standbys go to unavailable state, then a waiter should be
released once they have all either acknowledged that they are
unavailable (ie acknowledged that their lease has been revoked, via a
reply message with a serial number matching the revocation message),
or if that doesn't happen (due to lost network connection, crashed
process etc), once the any leases that have been issued have expired
(ie a few seconds). Is that not what you see?

With the setting synchronous_standby_names = '' and
synchronous_replay_standby_names = '*' we would get the standby's
status in pg_stat_replication, sync_state = 'async' and sync_replay =
'available'. It looks odd to me. Yes, this status is correct in
principle. But considering the architecture of PostgreSQL replication
this status is impossible.

Yes, this is essentially the same thing that you were arguing against
above. Perhaps you are right, and there are no people who would want
synchronous replay, but not synchronous commit.

The synchronous_replay_standby_name = '*' setting means that the
backend wait for all standbys connected to the master server to
replay, is that right? In my test, even when some of synchronous
replay standby servers got stuck and then therefore are revoked their
lease, the backend could proceed transactions.

It means that it waits for all standbys that are "available" to
replay. It doesn't wait for the "unavailable" ones. Most of the
patch deals with the transitions between those states. During an
available->revoking->unavailable transition, we also wait for the
standby to know that it is unavailable (so that it begins to raise
errors), and during an unavailable->joining->available transition we
also wait for the standby to replay the transition LSN (so that it
stops raising errors). That way clients on the standby can rely on
the error (or lack of error) to tell them whether their snapshot
definitely contains every commit that has returned control on the
primary.

--
Thomas Munro
http://www.enterprisedb.com

#16Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Thomas Munro (#15)
Re: Synchronous replay take III

Hello,

Sorry, missed email.

In our case we have implemented some kind of "replication barrier"

functionality based on table with counters (one counter per application
backend in simple case).

Each application backend have dedicated connection to each replica. And

it selects its counter value few times (2-100) per second from each replica
in background process (depending on how often replication barrier is used).

Interesting approach. Why don't you sample pg_last_wal_replay_lsn()
on all the standbys instead, so you don't have to generate extra write
traffic?

Replay lsn was the first approach I tried. I was sampling 'select
replay_lsn from pg_stat_replication' on master to get info about replay
position on replicas.
However, for some unknown reason I was not able to get it to work. Because
after replay_lsn was reached - standby was unable to see the data.
I know it should not happen. I spend few days on debugging... And… Since I
was required to ping replicas anyway (to check if it is a master already,
monitor ping, locks, connections, etc.) - I have decided to introduce table
for now.

Once application have committed transaction it may want join replication

barrier before return new data to a user. So, it increments counter in the
table and waits until all replicas have replayed that value according to
background monitoring process. Of course timeout, replicas health checks
and few optimizations and circuit breakers are used.

I'm interested in how you handle failure (taking too long to respond
or to see the new counter value, connectivity failure etc).
Specifically, if the writer decides to give up on a certain standby
(timeout, circuit breaker etc), how should a client that is connected
directly to that standby now or soon afterwards know that this standby
has been 'dropped' from the replication barrier and it's now at risk
of seeing stale data?

Each standby has some health flags attached to it. Health is "red" when:
* can't connect to replica, or all connections are in use
* replica lag according to pg_last_xact_replay_timestamp is more than 3000ms
* replica lag according to pg_last_xact_replay_timestamp was more than
3000ms some time ago (10000ms)
* replica is new master now
* etc.

In case of replication barrier, we are waiting only for "green" replicas
and max for 5000ms. If we still no able to see new counter value on some
replicas - it is up to client to decide how to process it. In our case, it
means replica is lagging more than 3000ms - so it is "red" now and next
client request will dispatched to another "green" replica. It is done by
special connection pool with balancer inside.
Not sure it is all 100% correct, but we could just proceed in our case.

someone can successfully execute queries on a standby
that hasn't applied a transaction that you know to be committed on the
primary.

Nice thing here - constant number of connection involved. Even if lot of

threads joining replication barrier in the moment. Even if some replicas
are lagging.

Because 2-5 seconds lag of some replica will lead to out of connections

issue in few milliseconds in case of implementation described in this
thread.

Right, if a standby is lagging more than the allowed amount, in my
patch the lease is cancelled and it will refuse to handle requests if
the GUC is on, with a special new error code, and then it's up to the
client to decide what to do. Probably find another node.
In case of high loded

Could you please elaborate? What could you do that would be better?
If the answer is that you just want to know that you might be seeing
stale data but for some reason you don't want to have to find a new
node, the reader is welcome to turn synchronous_standby off and try
again (giving up data freshness guarantees). Not sure when that would
be useful though.

Main problem I see here - is master connection usage. We have about 10.000
RPS on master. So, small lag on some replica (we have six of them) will
lead all master connections to be waiting for replay on stale replica until
timeout. It is out of service for whole system. Even if it lagged for
200-300ms (in real world it could lag for seconds on regular basis).

If we set synchronous_replay_max_lag to 10-20ms - standbys will be
cancelled all the time.
In our case, we are using constant amount of connections involved. In
addition, client requests are waiting for standby replay
inside application backend thread without blocking master connection. This
is the main difference as I think.

I've attached a small shell script that starts up a primary and N
replicas with synchronous_replay configured, in the hope of
encouraging you to try it out.

Thanks - will try and report.

#17Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#15)
Re: Synchronous replay take III

On Thu, Jan 24, 2019 at 6:42 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Yes, this is essentially the same thing that you were arguing against
above. Perhaps you are right, and there are no people who would want
synchronous replay, but not synchronous commit.

Maybe I'm misunderstanding the terminology here, but if not, I find
this theory wildly implausible. *Most* people want read-your-writes
behavior. *Few* people want to wait for a dead standby. The only
application of the later is when even a tiny risk of transaction loss
is unacceptable, but the former has all kinds of clustering-related
uses.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#17)
Re: Synchronous replay take III

On Fri, Feb 01, 2019 at 09:34:49AM -0500, Robert Haas wrote:

Maybe I'm misunderstanding the terminology here, but if not, I find
this theory wildly implausible. *Most* people want read-your-writes
behavior. *Few* people want to wait for a dead standby. The only
application of the later is when even a tiny risk of transaction loss
is unacceptable, but the former has all kinds of clustering-related
uses.

Last patch set fails to apply properly, so moved to next CF waiting on
author for a rebase.
--
Michael

#19Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Michael Paquier (#18)
1 attachment(s)
Re: Synchronous replay take III

On Mon, Feb 4, 2019 at 4:47 PM Michael Paquier <michael@paquier.xyz> wrote:

Last patch set fails to apply properly, so moved to next CF waiting on
author for a rebase.

Thanks. Rebased.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Synchronous-replay-mode-for-avoiding-stale-reads-v11.patchapplication/octet-stream; name=0001-Synchronous-replay-mode-for-avoiding-stale-reads-v11.patchDownload
From d6f09acbf44c1cc90be6fcc80fec53be4d75c60e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Wed, 12 Apr 2017 11:02:36 +1200
Subject: [PATCH] Synchronous replay mode for avoiding stale reads on hot
 standbys.

While the existing synchronous replication support is mainly concerned with
increasing durability, synchronous replay is concerned with increasing
availability.  When two transactions tx1, tx2 are run with synchronous_replay
set to on and tx1 reports successful commit before tx2 begins, then tx2 is
guaranteed either to see tx1 or to raise a new error 40P02 if it is run on a
hot standby.

Compared to the remote_apply feature introduced by commit 314cbfc5,
synchronous replay allows for graceful failure, certainty about which
standbys can provide non-stale reads in multi-standby configurations and a
limit on how much standbys can slow the primary server down.

To make effective use of this feature, clients require some intelligence
to route read-only transactions and to avoid servers that have recently
raised error 40P02.  It is anticipated that application frameworks and
middleware will be able to provide such intelligence so that application code
can remain unaware of whether read transactions are run on different servers.

Heikki Linnakangas and Simon Riggs expressed the view that this approach is
inferior to one based on clients tracking commit LSNs and asking standby
servers to wait for replay, but other reviewers have expressed support for
both approaches being available to users.

Author: Thomas Munro
Reviewed-By: Dmitry Dolgov, Thom Brown, Amit Langote, Simon Riggs,
             Joel Jacobson, Heikki Linnakangas, Michael Paquier, Simon Riggs,
             Robert Haas, Ants Aasma, Masahiko Sawada
Discussion: https://postgr.es/m/CAEepm%3D0Q6kCKMYFBN%2BVv2frPc%3D3cS3T1MPOxnZ9do8%2BNHzoJTA%40mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Discussion: https://postgr.es/m/CAEepm=1iiEzCVLD=RoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  87 ++++
 doc/src/sgml/high-availability.sgml           | 139 ++++-
 doc/src/sgml/monitoring.sgml                  |  12 +
 src/backend/access/transam/xact.c             |   2 +-
 src/backend/catalog/system_views.sql          |   1 +
 src/backend/postmaster/pgstat.c               |   6 +
 src/backend/replication/logical/worker.c      |   2 +
 src/backend/replication/syncrep.c             | 491 +++++++++++++++---
 src/backend/replication/walreceiver.c         |  82 ++-
 src/backend/replication/walreceiverfuncs.c    |  19 +
 src/backend/replication/walsender.c           | 368 +++++++++++--
 src/backend/utils/errcodes.txt                |   1 +
 src/backend/utils/misc/guc.c                  |  43 ++
 src/backend/utils/misc/postgresql.conf.sample |  19 +
 src/backend/utils/time/snapmgr.c              |  13 +
 src/bin/pg_basebackup/pg_recvlogical.c        |   6 +-
 src/bin/pg_basebackup/receivelog.c            |   5 +-
 src/include/catalog/pg_proc.dat               |   6 +-
 src/include/pgstat.h                          |   6 +-
 src/include/replication/syncrep.h             |  16 +-
 src/include/replication/walreceiver.h         |   9 +
 src/include/replication/walsender_private.h   |  20 +
 src/test/regress/expected/rules.out           |   3 +-
 23 files changed, 1213 insertions(+), 143 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9b7a7388d5a..d2f8d6c5441 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3463,6 +3463,36 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
      across the cluster without problems if that is required.
     </para>
 
+    <sect2 id="runtime-config-replication-all">
+     <title>All Servers</title>
+     <para>
+      These parameters can be set on the primary or any standby.
+     </para>
+     <variablelist>
+      <varlistentry id="guc-synchronous-replay" xreflabel="synchronous_replay">
+       <term><varname>synchronous_replay</varname> (<type>boolean</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Enables causal consistency between transactions run on different
+         servers.  A transaction that is run on a standby
+         with <varname>synchronous_replay</varname> set to <literal>on</literal> is
+         guaranteed either to see the effects of all completed transactions
+         run on the primary with the setting on, or to receive an error
+         "standby is not available for synchronous replay".  Note that both
+         transactions involved in a causal dependency (a write on the primary
+         followed by a read on any server which must see the write) must be
+         run with the setting on.  See <xref linkend="synchronous-replay"/> for
+         more details.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>     
+    </sect2>
+
     <sect2 id="runtime-config-replication-sender">
      <title>Sending Servers</title>
 
@@ -3781,6 +3811,63 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><varname>synchronous_replay_max_lag</varname>
+      (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_max_lag</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum replay lag the primary will tolerate from a
+        standby before dropping it from the synchronous replay set.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><varname>synchronous_replay_lease_time</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_replay_lease_time</varname> configuration
+        parameter</primary>
+       </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the duration of 'leases' sent by the primary server to
+        standbys granting them the right to run synchronous replay queries for
+        a limited time.  This affects the rate at which replacement leases
+        must be sent and the wait time if contact is lost with a standby.
+        This must be set to a value which is at least 4 times the maximum
+        possible difference in system clocks between the primary and standby
+        servers, as described in <xref linkend="synchronous-replay"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-synchronous-replay-standby-names" xreflabel="synchronous-replay-standby-names">
+      <term><varname>synchronous_replay_standby_names</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>synchronous_replay_standby_names</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies a comma-separated list of standby names that can support
+        <firstterm>synchronous replay</firstterm>, as described in
+        <xref linkend="synchronous-replay"/>.  Follows the same convention
+        as <link linkend="guc-synchronous-standby-names"><literal>synchronous_standby_name</literal></link>.
+        The default is <literal>*</literal>, matching all standbys.
+       </para>
+       <para>
+        This setting has no effect if <varname>synchronous_replay_max_lag</varname>
+        is not set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index bbab7395a21..26145d99b0c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1154,11 +1154,12 @@ primary_slot_name = 'node_a_slot'
    </para>
 
    <para>
-    Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
-    cause each commit to wait until the current synchronous standbys report
-    that they have replayed the transaction, making it visible to user
-    queries.  In simple cases, this allows for load balancing with causal
-    consistency.
+    Setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> will cause each commit to wait until
+    the current synchronous standbys report that they have replayed the
+    transaction, making it visible to user queries.  In simple cases, this
+    allows for load balancing with causal consistency.  See also
+    <xref linkend="synchronous-replay"/>.
    </para>
 
    <para>
@@ -1356,6 +1357,122 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
    </sect3>
   </sect2>
 
+  <sect2 id="synchronous-replay">
+   <title>Synchronous replay</title>
+   <indexterm>
+    <primary>synchronous replay</primary>
+    <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+    The synchronous replay feature allows read-only queries to run on hot
+    standby servers without exposing stale data to the client, providing a
+    form of causal consistency.  Transactions can run on any standby with the
+    following guarantee about the visibility of preceding transactions: If you
+    set <varname>synchronous_replay</varname> to <literal>on</literal> in any
+    pair of consecutive transactions tx1, tx2 where tx2 begins after tx1
+    successfully returns, then tx2 will either see tx1 or fail with a new
+    error "standby is not available for synchronous replay", no matter which
+    server it runs on.  Although the guarantee is expressed in terms of two
+    individual transactions, the GUC can also be set at session, role or
+    system level to make the guarantee generally, allowing for load balancing
+    of applications that were not designed with load balancing in mind.
+   </para>
+
+   <para>
+    In order to enable the
+    feature, <varname>synchronous_replay_max_lag</varname> must be set to a
+    non-zero value on the primary server.  The
+    GUC <varname>synchronous_replay_standby_names</varname> can be used to
+    limit the set of standbys that can join the dynamic set of synchronous
+    replay standbys by providing a comma-separated list of application names.
+    By default, all standbys are candidates, if the feature is enabled.
+   </para>
+
+   <para>
+    The current set of servers that the primary considers to be available for
+    synchronous replay can be seen in
+    the <link linkend="monitoring-stats-views-table"> <literal>pg_stat_replication</literal></link>
+    view.  Administrators, applications and load balancing middleware can use
+    this view to discover standbys that can currently handle synchronous
+    replay transactions without raising the error.  Since that information is
+    only an instantantaneous snapshot, clients should still be prepared for
+    the error to be raised at any time, and consider redirecting transactions
+    to another standby.
+   </para>
+
+   <para>
+    The advantages of the synchronous replay feature over simply
+    setting <varname>synchronous_commit</varname>
+    to <literal>remote_apply</literal> are:
+    <orderedlist>
+      <listitem>
+       <para>
+        It provides certainty about exactly which standbys can see a
+        transaction.
+       </para>
+      </listitem>
+      <listitem>
+       <para>
+        It places a configurable limit on how much replay lag (and therefore
+        delay at commit time) the primary tolerates from standbys before it
+        drops them from the dynamic set of standbys it waits for.
+       </para>   
+      </listitem>
+      <listitem>
+       <para>
+        It upholds the synchronous replay guarantee during the transitions that
+        occur when new standbys are added or removed from the set of standbys,
+        including scenarios where contact has been lost between the primary
+        and standbys but the standby is still alive and running client
+        queries.
+       </para>
+      </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    The protocol used to uphold the guarantee even in the case of network
+    failure depends on the system clocks of the primary and standby servers
+    being synchronized, with an allowance for a difference up to one quarter
+    of <varname>synchronous_replay_lease_time</varname>.  For example,
+    if <varname>synchronous_replay_lease_time</varname> is set
+    to <literal>5s</literal>, then the clocks must not be more than 1.25
+    second apart for the guarantee to be upheld reliably during transitions.
+    The ubiquity of the Network Time Protocol (NTP) on modern operating
+    systems and availability of high quality time servers makes it possible to
+    choose a tolerance significantly higher than the maximum expected clock
+    difference.  An effort is nevertheless made to detect and report
+    misconfigured and faulty systems with clock differences greater than the
+    configured tolerance.
+   </para>
+
+   <note>
+    <para>
+     Current hardware clocks, NTP implementations and public time servers are
+     unlikely to allow the system clocks to differ more than tens or hundreds
+     of milliseconds, and systems synchronized with dedicated local time
+     servers may be considerably more accurate, but you should only consider
+     setting <varname>synchronous_replay_lease_time</varname> below the
+     default of 5 seconds (allowing up to 1.25 second of clock difference)
+     after researching your time synchronization infrastructure thoroughly.
+    </para>  
+   </note>
+
+   <note>
+    <para>
+      While similar to synchronous commit in the sense that both involve the
+      primary server waiting for responses from standby servers, the
+      synchronous replay feature is not concerned with avoiding data loss.  A
+      primary configured for synchronous replay will drop all standbys that
+      stop responding or replay too slowly from the dynamic set that it waits
+      for, so you should consider configuring both synchronous replication and
+      synchronous replay if you need data loss avoidance guarantees and causal
+      consistency guarantees for load balancing.
+    </para>
+   </note>
+  </sect2>
+
   <sect2 id="continuous-archiving-in-standby">
    <title>Continuous archiving in standby</title>
 
@@ -1701,7 +1818,17 @@ if (!triggered)
     so there will be a measurable delay between primary and standby. Running the
     same query nearly simultaneously on both primary and standby might therefore
     return differing results. We say that data on the standby is
-    <firstterm>eventually consistent</firstterm> with the primary.  Once the
+    <firstterm>eventually consistent</firstterm> with the primary by default.
+    The data visible to a transaction running on a standby can be
+    made <firstterm>causally consistent</firstterm> with respect to a
+    transaction that has completed on the primary by
+    setting <varname>synchronous_replay</varname> to <literal>on</literal> in
+    both transactions.  For more details,
+    see <xref linkend="synchronous-replay"/>.
+   </para>
+
+   <para>
+    Once the    
     commit record for a transaction is replayed on the standby, the changes
     made by that transaction will be visible to any new snapshots taken on
     the standby.  Snapshots may be taken at the start of each query or at the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 7a84f513404..0b82caa1b5d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1916,6 +1916,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        </itemizedlist>
      </entry>
     </row>
+    <row>
+     <entry><structfield>sync_replay</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Synchronous replay state of this standby server.  This field will
+     be non-null only if <varname>synchronous_replay_max_lag</varname> is set.
+     If a standby is in <literal>available</literal> state, then it can
+     currently serve synchronous replay queries.  If it is not replaying fast
+     enough or not responding to keepalive messages, it will be
+     in <literal>unavailable</literal> state, and if it is currently
+     transitioning to availability it will be in <literal>joining</literal>
+     state for a short time.</entry>
+    </row>
     <row>
      <entry><structfield>reply_time</structfield></entry>
      <entry><type>timestamp with time zone</type></entry>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92bda878043..00b9b2cce41 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5316,7 +5316,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	 * Check if the caller would like to ask standbys for immediate feedback
 	 * once this commit is applied.
 	 */
-	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY)
+	if (synchronous_commit >= SYNCHRONOUS_COMMIT_REMOTE_APPLY || synchronous_replay)
 		xl_xinfo.xinfo |= XACT_COMPLETION_APPLY_FEEDBACK;
 
 	/*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c4..802de0e951c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -735,6 +735,7 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
+            W.sync_replay,
             W.reply_time
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c64992518..8a845694a90 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3683,6 +3683,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY:
+			event_name = "SyncReplay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
@@ -3711,6 +3714,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_RECOVERY_APPLY_DELAY:
 			event_name = "RecoveryApplyDelay";
 			break;
+		case WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE:
+			event_name = "SyncReplayLeaseRevoke";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f9516515bc4..cd29161dc21 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1186,6 +1186,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
 						TimestampTz timestamp;
 						bool		reply_requested;
 
+						(void) pq_getmsgint64(&s); /* skip messageNumber */
 						end_lsn = pq_getmsgint64(&s);
 						timestamp = pq_getmsgint64(&s);
 						reply_requested = pq_getmsgbyte(&s);
@@ -1389,6 +1390,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, writepos);	/* apply */
 	pq_sendint64(reply_message, now);	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
+	pq_sendint64(reply_message, -1);		/* replyTo */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
 		 force,
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 6c160c13c6f..8a753af0f83 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,13 @@
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/ps_status.h"
+#include "utils/varlena.h"
+
+/* GUC variables */
+int synchronous_replay_max_lag;
+int synchronous_replay_lease_time;
+bool synchronous_replay;
+char *synchronous_replay_standby_names;
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -99,7 +106,9 @@ static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
-static int	SyncRepWakeQueue(bool all, int mode);
+static int	SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn);
+
+static bool SyncRepCheckForEarlyExit(void);
 
 static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 					 XLogRecPtr *flushPtr,
@@ -128,6 +137,240 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * ===========================================================
  */
 
+/*
+ * Check if we can stop waiting for synchronous replay.  We can stop waiting
+ * when the following conditions are met:
+ *
+ * 1.  All walsenders currently in 'joining' or 'available' state have
+ * applied the target LSN.
+ *
+ * 2.  All revoked leases have been acknowledged by the relevant standby or
+ * expired, so we know that the standby has started rejecting synchronous
+ * replay transactions.
+ *
+ * The output parameter 'waitingFor' is set to the number of nodes we are
+ * currently waiting for.  The output parameters 'stallTimeMillis' is set to
+ * the number of milliseconds we need to wait for because a lease has been
+ * revoked.
+ *
+ * Returns true if commit can return control, because every standby has either
+ * applied the LSN or started rejecting synchronous replay transactions.
+ */
+static bool
+SyncReplayCommitCanReturn(XLogRecPtr XactCommitLSN,
+						  int *waitingFor,
+						  long *stallTimeMillis)
+{
+	TimestampTz now = GetCurrentTimestamp();
+	TimestampTz stallTime = 0;
+	int i;
+
+	/* Count how many joining/available nodes we are waiting for. */
+	*waitingFor = 0;
+
+	for (i = 0; i < max_wal_senders; ++i)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			/*
+			 * We need to hold the spinlock to read LSNs, because we can't be
+			 * sure they can be read atomically.
+			 */
+			SpinLockAcquire(&walsnd->mutex);
+			if (walsnd->pid != 0)
+			{
+				switch (walsnd->syncReplayState)
+				{
+				case SYNC_REPLAY_UNAVAILABLE:
+					/* Nothing to wait for. */
+					break;
+				case SYNC_REPLAY_JOINING:
+				case SYNC_REPLAY_AVAILABLE:
+					/*
+					 * We have to wait until this standby tells us that is has
+					 * replayed the commit record.
+					 */
+					if (walsnd->apply < XactCommitLSN)
+						++*waitingFor;
+					break;
+				case SYNC_REPLAY_REVOKING:
+					/*
+					 * We have to hold up commits until this standby
+					 * acknowledges that its lease was revoked, or we know the
+					 * most recently sent lease has expired anyway, whichever
+					 * comes first.  One way or the other, we don't release
+					 * until this standby has started raising an error for
+					 * synchronous replay transactions.
+					 */
+					if (walsnd->revokingUntil > now)
+					{
+						++*waitingFor;
+						stallTime = Max(stallTime, walsnd->revokingUntil);
+					}
+					break;
+				}
+			}
+			SpinLockRelease(&walsnd->mutex);
+		}
+	}
+
+	/*
+	 * If a walsender has exitted uncleanly, then it writes itsrevoking wait
+	 * time into a shared space before it gives up its WalSnd slot.  So we
+	 * have to wait for that too.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	if (WalSndCtl->revokingUntil > now)
+	{
+		long seconds;
+		int usecs;
+
+		/* Compute how long we have to wait, rounded up to nearest ms. */
+		TimestampDifference(now, WalSndCtl->revokingUntil,
+							&seconds, &usecs);
+		*stallTimeMillis = seconds * 1000 + (usecs + 999) / 1000;
+	}
+	else
+		*stallTimeMillis = 0;
+	LWLockRelease(SyncRepLock);
+
+	/* We are done if we are not waiting for any nodes or stalls. */
+	return *waitingFor == 0 && *stallTimeMillis == 0;
+}
+
+/*
+ * Wait for all standbys in "available" and "joining" standbys to replay
+ * XactCommitLSN, and all "revoking" standbys' leases to be revoked.  By the
+ * time we return, every standby will either have replayed XactCommitLSN or
+ * will have no lease, so an error would be raised if anyone tries to obtain a
+ * snapshot with synchronous_replay = on.
+ */
+static void
+SyncReplayWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	long stallTimeMillis;
+	int waitingFor;
+	char *ps_display_buffer = NULL;
+
+	for (;;)
+	{
+		int			rc;
+
+		/* Reset latch before checking state. */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Join the queue to be woken up if any synchronous replay
+		 * joining/available standby applies XactCommitLSN or the set of
+		 * synchronous replay standbys changes (if we aren't already in the
+		 * queue).  We don't actually know if we need to wait for any peers to
+		 * reach the target LSN yet, but we have to register just in case
+		 * before checking the walsenders' state to avoid a race condition
+		 * that could occur if we did it after calling
+		 * SynchronousReplayCommitCanReturn.  (SyncRepWaitForLSN doesn't have
+		 * to do this because it can check the highest-seen LSN in
+		 * walsndctl->lsn[mode] which is protected by SyncRepLock, the same
+		 * lock as the queues.  We can't do that here, because there is no
+		 * single highest-seen LSN that is useful.  We must check
+		 * walsnd->apply for all relevant walsenders.  Therefore we must
+		 * register for notifications first, so that we can be notified via
+		 * our latch of any standby applying the LSN we're interested in after
+		 * we check but before we start waiting, or we could wait forever for
+		 * something that has already happened.)
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		if (MyProc->syncRepState != SYNC_REP_WAITING)
+		{
+			MyProc->waitLSN = XactCommitLSN;
+			MyProc->syncRepState = SYNC_REP_WAITING;
+			SyncRepQueueInsert(SYNC_REP_WAIT_SYNC_REPLAY);
+			Assert(SyncRepQueueIsOrderedByLSN(SYNC_REP_WAIT_SYNC_REPLAY));
+		}
+		LWLockRelease(SyncRepLock);
+
+		/* Check if we're done. */
+		if (SyncReplayCommitCanReturn(XactCommitLSN, &waitingFor,
+									  &stallTimeMillis))
+		{
+			SyncRepCancelWait();
+			break;
+		}
+
+		Assert(waitingFor > 0 || stallTimeMillis > 0);
+
+		/* If we aren't actually waiting for any standbys, leave the queue. */
+		if (waitingFor == 0)
+			SyncRepCancelWait();
+
+		/* Update the ps title. */
+		if (update_process_title)
+		{
+			char buffer[80];
+
+			/* Remember the old value if this is our first update. */
+			if (ps_display_buffer == NULL)
+			{
+				int len;
+				const char *ps_display = get_ps_display(&len);
+
+				ps_display_buffer = palloc(len + 1);
+				memcpy(ps_display_buffer, ps_display, len);
+				ps_display_buffer[len] = '\0';
+			}
+
+			snprintf(buffer, sizeof(buffer),
+					 "waiting for %d peer(s) to apply %X/%X%s",
+					 waitingFor,
+					 (uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN,
+					 stallTimeMillis > 0 ? " (revoking)" : "");
+			set_ps_display(buffer, false);
+		}
+
+		/* Check if we need to exit early due to postmaster death etc. */
+		if (SyncRepCheckForEarlyExit()) /* Calls SyncRepCancelWait() if true. */
+			break;
+
+		/*
+		 * If are still waiting for peers, then we wait for any joining or
+		 * available peer to reach the LSN (or possibly stop being in one of
+		 * those states or go away).
+		 *
+		 * If not, there must be a non-zero stall time, so we wait for that to
+		 * elapse.
+		 */
+		if (waitingFor > 0)
+			rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+						   WAIT_EVENT_SYNC_REPLAY);
+		else
+			rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH |
+						   WL_TIMEOUT,
+						   stallTimeMillis,
+						   WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE);
+
+		if (rc & WL_POSTMASTER_DEATH)
+		{
+			ProcDiePending = true;
+			whereToSendOutput = DestNone;
+			SyncRepCancelWait();
+			break;
+		}
+	}
+
+	/* There is no way out of the loop that could leave us in the queue. */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN = 0;
+
+	/* Restore the ps display. */
+	if (ps_display_buffer != NULL)
+	{
+		set_ps_display(ps_display_buffer, false);
+		pfree(ps_display_buffer);
+	}
+}
+
 /*
  * Wait for synchronous replication, if requested by user.
  *
@@ -149,11 +392,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	const char *old_status;
 	int			mode;
 
-	/* Cap the level for anything other than commit to remote flush only. */
-	if (commit)
-		mode = SyncRepWaitMode;
-	else
-		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+	/* Wait for synchronous replay, if configured. */
+	if (synchronous_replay)
+		SyncReplayWaitForLSN(lsn);
 
 	/*
 	 * Fast exit if user has not requested sync replication.
@@ -167,6 +408,12 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
+	/* Cap the level for anything other than commit to remote flush only. */
+	if (commit)
+		mode = SyncRepWaitMode;
+	else
+		mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
+
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
 	 * set.  See SyncRepUpdateSyncStandbysDefined.
@@ -229,44 +476,9 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
 			break;
 
-		/*
-		 * If a wait for synchronous replication is pending, we can neither
-		 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
-		 * lead the client to believe that the transaction aborted, which is
-		 * not true: it's already committed locally. The former is no good
-		 * either: the client has requested synchronous replication, and is
-		 * entitled to assume that an acknowledged commit is also replicated,
-		 * which might not be true. So in this case we issue a WARNING (which
-		 * some clients may be able to interpret) and shut off further output.
-		 * We do NOT reset ProcDiePending, so that the process will die after
-		 * the commit is cleaned up.
-		 */
-		if (ProcDiePending)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			whereToSendOutput = DestNone;
-			SyncRepCancelWait();
+		/* Check if we need to break early due to cancel/shutdown. */
+		if (SyncRepCheckForEarlyExit())
 			break;
-		}
-
-		/*
-		 * It's unclear what to do if a query cancel interrupt arrives.  We
-		 * can't actually abort at this point, but ignoring the interrupt
-		 * altogether is not helpful, so we just terminate the wait with a
-		 * suitable warning.
-		 */
-		if (QueryCancelPending)
-		{
-			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
-			SyncRepCancelWait();
-			break;
-		}
 
 		/*
 		 * Wait on latch.  Any condition that should wake us up will set the
@@ -401,15 +613,66 @@ SyncRepInitConfig(void)
 	}
 }
 
+/*
+ * Check if the current WALSender process's application_name matches a name in
+ * synchronous_replay_standby_names (including '*' for wildcard).
+ */
+bool
+SyncReplayPotentialStandby(void)
+{
+	char *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		found = false;
+
+	/* If the feature is disable, then no. */
+	if (synchronous_replay_max_lag == 0)
+		return false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(synchronous_replay_standby_names);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		/* GUC machinery will have already complained - no need to do again */
+		return false;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return found;
+}
+
 /*
  * Update the LSNs on each queue based upon our latest state. This
  * implements a simple policy of first-valid-sync-standby-releases-waiter.
  *
+ * 'am_syncreplay_blocker' should be set to true if the standby managed by
+ * this walsender is in a synchronous replay state that blocks commit (joining
+ * or available).
+ *
  * Other policies are possible, which would change what we do here and
  * perhaps also which information we store as well.
  */
 void
-SyncRepReleaseWaiters(void)
+SyncRepReleaseWaiters(bool am_syncreplay_blocker)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	XLogRecPtr	writePtr;
@@ -423,15 +686,17 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
-	 * potential sync standbys then we have nothing to do. If we are still
-	 * starting up, still running base backup or the current flush position is
-	 * still invalid, then leave quickly also.  Streaming or stopping WAL
-	 * senders are allowed to release waiters.
+	 * potential sync standbys and not in a state that synchronous_replay waits
+	 * for, then we have nothing to do. If we are still starting up, still
+	 * running base backup or the current flush position is still invalid,
+	 * then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		(MyWalSnd->state != WALSNDSTATE_STREAMING &&
-		 MyWalSnd->state != WALSNDSTATE_STOPPING) ||
-		XLogRecPtrIsInvalid(MyWalSnd->flush))
+	if (!am_syncreplay_blocker &&
+		(MyWalSnd->sync_standby_priority == 0 ||
+		 (MyWalSnd->state != WALSNDSTATE_STREAMING &&
+		  MyWalSnd->state != WALSNDSTATE_STOPPING) ||
+		 MyWalSnd->state < WALSNDSTATE_STREAMING ||
+		 XLogRecPtrIsInvalid(MyWalSnd->flush)))
 	{
 		announce_next_takeover = true;
 		return;
@@ -469,9 +734,10 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * If the number of sync standbys is less than requested or we aren't
-	 * managing a sync standby then just leave.
+	 * managing a sync standby or a standby in synchronous replay state that
+	 * blocks then just leave.
 	 */
-	if (!got_recptr || !am_sync)
+	if ((!got_recptr || !am_sync) && !am_syncreplay_blocker)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -480,24 +746,36 @@ SyncRepReleaseWaiters(void)
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, for backends waiting for synchronous commit.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
-		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
-	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
-		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
-	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+	if (got_recptr && am_sync)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
-		numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
+		if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < writePtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_WRITE] = writePtr;
+			numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE, writePtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < flushPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = flushPtr;
+			numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH, flushPtr);
+		}
+		if (walsndctl->lsn[SYNC_REP_WAIT_APPLY] < applyPtr)
+		{
+			walsndctl->lsn[SYNC_REP_WAIT_APPLY] = applyPtr;
+			numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY, applyPtr);
+		}
 	}
 
+	/*
+	 * Wake backends that are waiting for synchronous_replay, if this walsender
+	 * manages a standby that is in synchronous replay 'available' or 'joining'
+	 * state.
+	 */
+	if (am_syncreplay_blocker)
+		SyncRepWakeQueue(false, SYNC_REP_WAIT_SYNC_REPLAY,
+						 MyWalSnd->apply);
+
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to apply %X/%X",
@@ -997,9 +1275,8 @@ SyncRepGetStandbyPriority(void)
  * Must hold SyncRepLock.
  */
 static int
-SyncRepWakeQueue(bool all, int mode)
+SyncRepWakeQueue(bool all, int mode, XLogRecPtr lsn)
 {
-	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	PGPROC	   *proc = NULL;
 	PGPROC	   *thisproc = NULL;
 	int			numprocs = 0;
@@ -1016,7 +1293,7 @@ SyncRepWakeQueue(bool all, int mode)
 		/*
 		 * Assume the queue is ordered by LSN
 		 */
-		if (!all && walsndctl->lsn[mode] < proc->waitLSN)
+		if (!all && lsn < proc->waitLSN)
 			return numprocs;
 
 		/*
@@ -1083,7 +1360,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 			int			i;
 
 			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+				SyncRepWakeQueue(true, i, InvalidXLogRecPtr);
 		}
 
 		/*
@@ -1134,6 +1411,51 @@ SyncRepQueueIsOrderedByLSN(int mode)
 }
 #endif
 
+static bool
+SyncRepCheckForEarlyExit(void)
+{
+	/*
+	 * If a wait for synchronous replication is pending, we can neither
+	 * acknowledge the commit nor raise ERROR or FATAL.  The latter would
+	 * lead the client to believe that the transaction aborted, which is
+	 * not true: it's already committed locally. The former is no good
+	 * either: the client has requested synchronous replication, and is
+	 * entitled to assume that an acknowledged commit is also replicated,
+	 * which might not be true. So in this case we issue a WARNING (which
+	 * some clients may be able to interpret) and shut off further output.
+	 * We do NOT reset ProcDiePending, so that the process will die after
+	 * the commit is cleaned up.
+	 */
+	if (ProcDiePending)
+	{
+		ereport(WARNING,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		whereToSendOutput = DestNone;
+		SyncRepCancelWait();
+		return true;
+	}
+
+	/*
+	 * It's unclear what to do if a query cancel interrupt arrives.  We
+	 * can't actually abort at this point, but ignoring the interrupt
+	 * altogether is not helpful, so we just terminate the wait with a
+	 * suitable warning.
+	 */
+	if (QueryCancelPending)
+	{
+		QueryCancelPending = false;
+		ereport(WARNING,
+				(errmsg("canceling wait for synchronous replication due to user request"),
+				 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+		SyncRepCancelWait();
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * ===========================================================
  * Synchronous Replication functions executed by any process
@@ -1203,6 +1525,31 @@ assign_synchronous_standby_names(const char *newval, void *extra)
 	SyncRepConfig = (SyncRepConfigData *) extra;
 }
 
+bool
+check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return true;
+}
+
 void
 assign_synchronous_commit(int newval, void *extra)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2e90944ad52..f2ad595de87 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -58,6 +58,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
@@ -140,9 +141,10 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo);
 static void XLogWalRcvSendHSFeedback(bool immed);
-static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+								  TimestampTz *syncReplayLease);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -476,7 +478,7 @@ WalReceiverMain(void)
 					}
 
 					/* Let the master know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, -1);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -521,7 +523,7 @@ WalReceiverMain(void)
 						 */
 						walrcv->force_reply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(true, false, -1);
 					}
 				}
 				if (rc & WL_TIMEOUT)
@@ -570,7 +572,7 @@ WalReceiverMain(void)
 						}
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, -1);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -862,6 +864,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	TimestampTz syncReplayLease;
+	int64		messageNumber;
 
 	resetStringInfo(&incoming_message);
 
@@ -881,7 +885,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				dataStart = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, NULL);
 
 				buf += hdrlen;
 				len -= hdrlen;
@@ -891,7 +895,8 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(int64) +
+					sizeof(char) + sizeof(int64);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -899,15 +904,17 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				appendBinaryStringInfo(&incoming_message, buf, hdrlen);
 
 				/* read the fields */
+				messageNumber = pq_getmsgint64(&incoming_message);
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				syncReplayLease = pq_getmsgint64(&incoming_message);
 
-				ProcessWalSndrMessage(walEnd, sendTime);
+				ProcessWalSndrMessage(walEnd, sendTime, &syncReplayLease);
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, messageNumber);
 				break;
 			}
 		default:
@@ -1070,7 +1077,7 @@ XLogWalRcvFlush(bool dying)
 		/* Also let the master know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, -1);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1088,9 +1095,12 @@ XLogWalRcvFlush(bool dying)
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbearts, when approaching
  * wal_receiver_timeout.
+ *
+ * If this is a reply to a specific message from the upstream server, then
+ * 'replyTo' should include the message number, otherwise -1.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, int64 replyTo)
 {
 	static XLogRecPtr writePtr = 0;
 	static XLogRecPtr flushPtr = 0;
@@ -1137,6 +1147,7 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	pq_sendint64(&reply_message, applyPtr);
 	pq_sendint64(&reply_message, GetCurrentTimestamp());
 	pq_sendbyte(&reply_message, requestReply ? 1 : 0);
+	pq_sendint64(&reply_message, replyTo);
 
 	/* Send it */
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X%s",
@@ -1269,15 +1280,56 @@ XLogWalRcvSendHSFeedback(bool immed)
  * Update shared memory status upon receiving a message from primary.
  *
  * 'walEnd' and 'sendTime' are the end-of-WAL and timestamp of the latest
- * message, reported by primary.
+ * message, reported by primary.  'syncReplayLease' is a pointer to the time
+ * the primary promises that this standby can safely claim to be causally
+ * consistent, to 0 if it cannot, or a NULL pointer for no change.
  */
 static void
-ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime,
+					  TimestampTz *syncReplayLease)
 {
 	WalRcvData *walrcv = WalRcv;
 
 	TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
 
+	/* Sanity check for the syncReplayLease time. */
+	if (syncReplayLease != NULL && *syncReplayLease != 0)
+	{
+		/*
+		 * Deduce max_clock_skew from the syncReplayLease and sendTime since
+		 * we don't have access to the primary's GUC.  The primary already
+		 * substracted 25% from synchronous_replay_lease_time to represent
+		 * max_clock_skew, so we have 75%.  A third of that will give us 25%.
+		 */
+		int64 diffMillis = (*syncReplayLease - sendTime) / 1000;
+		int64 max_clock_skew = diffMillis / 3;
+		if (sendTime > TimestampTzPlusMilliseconds(lastMsgReceiptTime,
+												   max_clock_skew))
+		{
+			/*
+			 * The primary's clock is more than max_clock_skew + network
+			 * latency ahead of the standby's clock.  (If the primary's clock
+			 * is more than max_clock_skew ahead of the standby's clock, but
+			 * by less than the network latency, then there isn't much we can
+			 * do to detect that; but it still seems useful to have this basic
+			 * sanity check for wildly misconfigured servers.)
+			 */
+			ereport(LOG,
+					(errmsg("the primary server's clock time is too far ahead for synchronous_replay"),
+					 errhint("Check your servers' NTP configuration or equivalent.")));
+
+			syncReplayLease = NULL;
+		}
+		/*
+		 * We could also try to detect cases where sendTime is more than
+		 * max_clock_skew in the past according to the standby's clock, but
+		 * that is indistinguishable from network latency/buffering, so we
+		 * could produce misleading error messages; if we do nothing, the
+		 * consequence is 'standby is not available for synchronous replay'
+		 * errors which should cause the user to investigate.
+		 */
+	}
+
 	/* Update shared-memory status */
 	SpinLockAcquire(&walrcv->mutex);
 	if (walrcv->latestWalEnd < walEnd)
@@ -1285,6 +1337,8 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->latestWalEnd = walEnd;
 	walrcv->lastMsgSendTime = sendTime;
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+	if (syncReplayLease != NULL)
+		walrcv->syncReplayLease = *syncReplayLease;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (log_min_messages <= DEBUG2)
@@ -1322,7 +1376,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
  * This is called by the startup process whenever interesting xlog records
  * are applied, so that walreceiver can check if it needs to send an apply
  * notification back to the master which may be waiting in a COMMIT with
- * synchronous_commit = remote_apply.
+ * synchronous_commit = remote_apply or synchronous_relay = on.
  */
 void
 WalRcvForceReply(void)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 2d6cdfe0a21..a6f4ede644c 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -27,6 +27,7 @@
 #include "replication/walreceiver.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/guc.h"
 #include "utils/timestamp.h"
 
 WalRcvData *WalRcv = NULL;
@@ -376,3 +377,21 @@ GetReplicationTransferLatency(void)
 
 	return ms;
 }
+
+/*
+ * Used by snapmgr to check if this standby has a valid lease, granting it the
+ * right to consider itself available for synchronous replay.
+ */
+bool
+WalRcvSyncReplayAvailable(void)
+{
+	WalRcvData *walrcv = WalRcv;
+	TimestampTz now = GetCurrentTimestamp();
+	bool result;
+
+	SpinLockAcquire(&walrcv->mutex);
+	result = walrcv->syncReplayLease != 0 && now <= walrcv->syncReplayLease;
+	SpinLockRelease(&walrcv->mutex);
+
+	return result;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2d2eb23eb73..4e10c4b1465 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -173,6 +173,18 @@ static TimestampTz last_reply_timestamp = 0;
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
+/* At what point in the WAL can we progress from JOINING state? */
+static XLogRecPtr synchronous_replay_joining_until = 0;
+
+/* The last synchronous replay lease sent to the standby. */
+static TimestampTz synchronous_replay_last_lease = 0;
+
+/* The last synchronous replay lease revocation message's number. */
+static int64 synchronous_replay_revoke_msgno = 0;
+
+/* Is this WALSender listed in synchronous_replay_standby_names? */
+static bool am_potential_synchronous_replay_standby = false;
+
 /*
  * While streaming WAL in Copy mode, streamingDoneSending is set to true
  * after we have sent CopyDone. We should not send any more CopyData messages
@@ -244,7 +256,7 @@ static void ProcessStandbyMessage(void);
 static void ProcessStandbyReplyMessage(void);
 static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
-static void WalSndKeepalive(bool requestReply);
+static int64 WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(void);
 static void WalSndCheckTimeOut(void);
 static long WalSndComputeSleeptime(TimestampTz now);
@@ -287,6 +299,61 @@ InitWalSender(void)
 	lag_tracker = MemoryContextAllocZero(TopMemoryContext, sizeof(LagTracker));
 }
 
+/*
+ * If we are exiting unexpectedly, we may need to hold up concurrent
+ * synchronous_replay commits to make sure any lease that was granted has
+ * expired.
+ */
+static void
+PrepareUncleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * We've lost contact with the standby, but it may still be alive.  We
+		 * can't let any committing synchronous_replay transactions return
+		 * control until we've stalled for long enough for a zombie standby to
+		 * start raising errors because its lease has expired.  Because our
+		 * WalSnd slot is going away, we need to use the shared
+		 * WalSndCtl->revokingUntil variable.
+		 */
+		elog(LOG,
+			 "contact lost with standby \"%s\", revoking synchronous replay lease by stalling",
+			 application_name);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->revokingUntil = Max(WalSndCtl->revokingUntil,
+									   synchronous_replay_last_lease);
+		LWLockRelease(SyncRepLock);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
+/*
+ * We are shutting down because we received a goodbye message from the
+ * walreceiver.
+ */
+static void
+PrepareCleanExit(void)
+{
+	if (MyWalSnd->syncReplayState == SYNC_REPLAY_AVAILABLE)
+	{
+		/*
+		 * The standby is shutting down, so it won't be running any more
+		 * transactions.  It is therefore safe to stop waiting for it without
+		 * any kind of lease revocation protocol.
+		 */
+		elog(LOG, "standby \"%s\" is leaving synchronous replay set", application_name);
+
+		SpinLockAcquire(&MyWalSnd->mutex);
+		MyWalSnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
+		SpinLockRelease(&MyWalSnd->mutex);
+	}
+}
+
 /*
  * Clean up after an error.
  *
@@ -315,7 +382,10 @@ WalSndErrorCleanup(void)
 	replication_active = false;
 
 	if (got_STOPPING || got_SIGUSR2)
+	{
+		PrepareUncleanExit();
 		proc_exit(0);
+	}
 
 	/* Revert back to startup state */
 	WalSndSetState(WALSNDSTATE_STARTUP);
@@ -327,6 +397,8 @@ WalSndErrorCleanup(void)
 static void
 WalSndShutdown(void)
 {
+	PrepareUncleanExit();
+
 	/*
 	 * Reset whereToSendOutput to prevent ereport from attempting to send any
 	 * more messages to the standby.
@@ -1600,6 +1672,7 @@ ProcessRepliesIfAny(void)
 		if (r < 0)
 		{
 			/* unexpected error or EOF */
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1616,6 +1689,7 @@ ProcessRepliesIfAny(void)
 		resetStringInfo(&reply_message);
 		if (pq_getmessage(&reply_message, 0))
 		{
+			PrepareUncleanExit();
 			ereport(COMMERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
 					 errmsg("unexpected EOF on standby connection")));
@@ -1665,6 +1739,7 @@ ProcessRepliesIfAny(void)
 				 * 'X' means that the standby is closing down the socket.
 				 */
 			case 'X':
+				PrepareCleanExit();
 				proc_exit(0);
 
 			default:
@@ -1762,10 +1837,12 @@ ProcessStandbyReplyMessage(void)
 				flushLag,
 				applyLag;
 	bool		clearLagTimes;
+	int64		replyTo;
 	TimestampTz now;
 	TimestampTz replyTime;
 
 	static bool fullyAppliedLastTime = false;
+	static TimestampTz fullyAppliedSince = 0;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -1773,6 +1850,7 @@ ProcessStandbyReplyMessage(void)
 	applyPtr = pq_getmsgint64(&reply_message);
 	replyTime = pq_getmsgint64(&reply_message);
 	replyRequested = pq_getmsgbyte(&reply_message);
+	replyTo = pq_getmsgint64(&reply_message);
 
 	if (log_min_messages <= DEBUG2)
 	{
@@ -1798,17 +1876,17 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL for at least
+	 * wal_receiver_status_interval, then let's clear the lag times that were
+	 * measured when it last wrote/flushed/applied a WAL record.  This way we
+	 * avoid displaying stale lag data until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
 	if (applyPtr == sentPtr)
 	{
-		if (fullyAppliedLastTime)
+		if (!fullyAppliedLastTime)
+			fullyAppliedSince = now;
+		else if (now - fullyAppliedSince >= wal_receiver_status_interval * USECS_PER_SEC)
 			clearLagTimes = true;
 		fullyAppliedLastTime = true;
 	}
@@ -1824,8 +1902,53 @@ ProcessStandbyReplyMessage(void)
 	 * standby.
 	 */
 	{
+		int			next_sr_state = -1;
 		WalSnd	   *walsnd = MyWalSnd;
 
+		/* Handle synchronous replay state machine. */
+		if (am_potential_synchronous_replay_standby && !am_cascading_walsender)
+		{
+			bool replay_lag_acceptable;
+
+			/* Check if the lag is acceptable (includes -1 for caught up). */
+			if (applyLag < synchronous_replay_max_lag * 1000)
+				replay_lag_acceptable = true;
+			else
+				replay_lag_acceptable = false;
+
+			/* Figure out next if the state needs to change. */
+			switch (walsnd->syncReplayState)
+			{
+			case SYNC_REPLAY_UNAVAILABLE:
+				/* Can we join? */
+				if (replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_JOINING;
+				break;
+			case SYNC_REPLAY_JOINING:
+				/* Are we still applying fast enough? */
+				if (replay_lag_acceptable)
+				{
+					/* Have we reached the join point yet? */
+					if (applyPtr >= synchronous_replay_joining_until)
+						next_sr_state = SYNC_REPLAY_AVAILABLE;
+				}
+				else
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			case SYNC_REPLAY_AVAILABLE:
+				/* Are we still applying fast enough? */
+				if (!replay_lag_acceptable)
+					next_sr_state = SYNC_REPLAY_REVOKING;
+				break;
+			case SYNC_REPLAY_REVOKING:
+				/* Has the revocation been acknowledged or timed out? */
+				if (replyTo == synchronous_replay_revoke_msgno ||
+					now >= walsnd->revokingUntil)
+					next_sr_state = SYNC_REPLAY_UNAVAILABLE;
+				break;
+			}
+		}
+
 		SpinLockAcquire(&walsnd->mutex);
 		walsnd->write = writePtr;
 		walsnd->flush = flushPtr;
@@ -1837,11 +1960,55 @@ ProcessStandbyReplyMessage(void)
 		if (applyLag != -1 || clearLagTimes)
 			walsnd->applyLag = applyLag;
 		walsnd->replyTime = replyTime;
+		if (next_sr_state != -1)
+			walsnd->syncReplayState = next_sr_state;
+		if (next_sr_state == SYNC_REPLAY_REVOKING)
+			walsnd->revokingUntil = synchronous_replay_last_lease;
 		SpinLockRelease(&walsnd->mutex);
+
+		/*
+		 * Post shmem-update actions for synchronous replay state transitions.
+		 */
+		switch (next_sr_state)
+		{
+		case SYNC_REPLAY_JOINING:
+			/*
+			 * Now that we've started waiting for this standby, we need to
+			 * make sure that everything flushed before now has been applied
+			 * before we move to available and issue a lease.
+			 */
+			synchronous_replay_joining_until = GetFlushRecPtr();
+			ereport(LOG,
+					(errmsg("standby \"%s\" joining synchronous replay set...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_AVAILABLE:
+			/* Issue a new lease to the standby. */
+			WalSndKeepalive(false);
+			ereport(LOG,
+					(errmsg("standby \"%s\" is available for synchronous replay",
+							application_name)));
+			break;
+		case SYNC_REPLAY_REVOKING:
+			/* Revoke the standby's lease, and note the message number. */
+			synchronous_replay_revoke_msgno = WalSndKeepalive(true);
+			ereport(LOG,
+					(errmsg("revoking synchronous replay lease for standby \"%s\"...",
+							application_name)));
+			break;
+		case SYNC_REPLAY_UNAVAILABLE:
+			ereport(LOG,
+					(errmsg("standby \"%s\" is no longer available for synchronous replay",
+							application_name)));
+			break;
+		default:
+			/* No change. */
+			break;
+		}
 	}
 
 	if (!am_cascading_walsender)
-		SyncRepReleaseWaiters();
+		SyncRepReleaseWaiters(MyWalSnd->syncReplayState >= SYNC_REPLAY_JOINING);
 
 	/*
 	 * Advance our local xmin horizon when the client confirmed a flush.
@@ -2055,33 +2222,52 @@ ProcessStandbyHSFeedbackMessage(void)
  * If wal_sender_timeout is enabled we want to wake up in time to send
  * keepalives and to abort the connection if wal_sender_timeout has been
  * reached.
+ *
+ * But if syncronous_replay_max_lag is enabled, we override that and send
+ * keepalives at a constant rate to replace expiring leases.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
 	long		sleeptime = 10000;	/* 10 s */
 
-	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
+	if ((wal_sender_timeout > 0 && last_reply_timestamp > 0) ||
+		am_potential_synchronous_replay_standby)
 	{
 		TimestampTz wakeup_time;
 		long		sec_to_timeout;
 		int			microsec_to_timeout;
 
-		/*
-		 * At the latest stop sleeping once wal_sender_timeout has been
-		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
-		 * If no ping has been sent yet, wakeup when it's time to do so.
-		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
-		 * the timeout passed without a response.
-		 */
-		if (!waiting_for_ping_response)
+		if (am_potential_synchronous_replay_standby)
+		{
+			/*
+			 * We need to keep replacing leases before they expire.  We'll do
+			 * that halfway through the lease time according to our clock, to
+			 * allow for the standby's clock to be ahead of the primary's by
+			 * 25% of synchronous_replay_lease_time.
+			 */
+			wakeup_time =
+				TimestampTzPlusMilliseconds(last_reply_timestamp,
+											synchronous_replay_lease_time / 2);
+		}
+		else
+		{
+			/*
+			 * At the latest stop sleeping once wal_sender_timeout has been
+			 * reached.
+			 */
 			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
+													  wal_sender_timeout);
+
+			/*
+			 * If no ping has been sent yet, wakeup when it's time to do so.
+			 * WalSndKeepaliveIfNecessary() wants to send a keepalive once
+			 * half of the timeout passed without a response.
+			 */
+			if (!waiting_for_ping_response)
+				wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+														  wal_sender_timeout / 2);
+		}
 
 		/* Compute relative time until wakeup. */
 		TimestampDifference(now, wakeup_time,
@@ -2105,20 +2291,33 @@ WalSndComputeSleeptime(TimestampTz now)
  * message every standby_message_timeout = wal_sender_timeout/6 = 10s.  We
  * could eliminate that problem by recognizing timeout expiration at
  * wal_sender_timeout/2 after the keepalive.
+ *
+ * If synchronous replay is configured we override that so that  unresponsive
+ * standbys are detected sooner.
  */
 static void
 WalSndCheckTimeOut(void)
 {
 	TimestampTz timeout;
+	int allowed_time;
 
 	/* don't bail out if we're doing something that doesn't require timeouts */
 	if (last_reply_timestamp <= 0)
 		return;
 
-	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
-										  wal_sender_timeout);
+	/*
+	 * If a synchronous replay support is configured, we use
+	 * synchronous_replay_lease_time instead of wal_sender_timeout, to limit
+	 * the time before an unresponsive synchronous replay standby is dropped.
+	 */
+	if (am_potential_synchronous_replay_standby)
+		allowed_time = synchronous_replay_lease_time;
+	else
+		allowed_time = wal_sender_timeout;
 
-	if (wal_sender_timeout > 0 && last_processing >= timeout)
+	timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
+										  allowed_time);
+	if (allowed_time > 0 && last_processing >= timeout)
 	{
 		/*
 		 * Since typically expiration of replication timeout means
@@ -2143,6 +2342,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
 	last_reply_timestamp = GetCurrentTimestamp();
 	waiting_for_ping_response = false;
 
+	/* Check if we are managing a potential synchronous replay standby. */
+	am_potential_synchronous_replay_standby = SyncReplayPotentialStandby();
+
 	/*
 	 * Loop until we reach the end of this timeline or the client requests to
 	 * stop streaming.
@@ -2301,6 +2503,7 @@ InitWalSenderSlot(void)
 			walsnd->flushLag = -1;
 			walsnd->applyLag = -1;
 			walsnd->state = WALSNDSTATE_STARTUP;
+			walsnd->syncReplayState = SYNC_REPLAY_UNAVAILABLE;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
 			SpinLockRelease(&walsnd->mutex);
@@ -3198,6 +3401,27 @@ WalSndGetStateString(WalSndState state)
 	return "UNKNOWN";
 }
 
+/*
+ * Return a string constant representing the synchronous replay state. This is
+ * used in system views, and should *not* be translated.
+ */
+static const char *
+WalSndGetSyncReplayStateString(SyncReplayState state)
+{
+	switch (state)
+	{
+	case SYNC_REPLAY_UNAVAILABLE:
+		return "unavailable";
+	case SYNC_REPLAY_JOINING:
+		return "joining";
+	case SYNC_REPLAY_AVAILABLE:
+		return "available";
+	case SYNC_REPLAY_REVOKING:
+		return "revoking";
+	}
+	return "UNKNOWN";
+}
+
 static Interval *
 offset_to_interval(TimeOffset offset)
 {
@@ -3217,7 +3441,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	13
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -3272,6 +3496,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		int			pid;
 		WalSndState state;
 		TimestampTz replyTime;
+		SyncReplayState syncReplayState;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
 
@@ -3284,6 +3509,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		pid = walsnd->pid;
 		sentPtr = walsnd->sentPtr;
 		state = walsnd->state;
+		syncReplayState = walsnd->syncReplayState;
 		write = walsnd->write;
 		flush = walsnd->flush;
 		apply = walsnd->apply;
@@ -3369,10 +3595,13 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			else
 				values[10] = CStringGetTextDatum("potential");
 
+			values[11] =
+				CStringGetTextDatum(WalSndGetSyncReplayStateString(syncReplayState));
+
 			if (replyTime == 0)
-				nulls[11] = true;
+				nulls[12] = true;
 			else
-				values[11] = TimestampTzGetDatum(replyTime);
+				values[12] = TimestampTzGetDatum(replyTime);
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
@@ -3388,21 +3617,69 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
   * This function is used to send a keepalive message to standby.
   * If requestReply is set, sets a flag in the message requesting the standby
   * to send a message back to us, for heartbeat purposes.
+  * Return the serial number of the message that was sent.
   */
-static void
+static int64
 WalSndKeepalive(bool requestReply)
 {
+	TimestampTz synchronous_replay_lease;
+	TimestampTz now;
+
+	static int64 message_number = 0;
+
 	elog(DEBUG2, "sending replication keepalive");
 
+	/* Grant a synchronous replay lease if appropriate. */
+	now = GetCurrentTimestamp();
+	if (MyWalSnd->syncReplayState != SYNC_REPLAY_AVAILABLE)
+	{
+		/* No lease granted, and any earlier lease is revoked. */
+		synchronous_replay_lease = 0;
+	}
+	else
+	{
+		/*
+		 * Since this timestamp is being sent to the standby where it will be
+		 * compared against a time generated by the standby's system clock, we
+		 * must consider clock skew.  We use 25% of the lease time as max
+		 * clock skew, and we subtract that from the time we send with the
+		 * following reasoning:
+		 *
+		 * 1.  If the standby's clock is slow (ie behind the primary's) by up
+		 * to that much, then by subtracting this amount will make sure the
+		 * lease doesn't survive past that time according to the primary's
+		 * clock.
+		 *
+		 * 2.  If the standby's clock is fast (ie ahead of the primary's) by
+		 * up to that much, then by subtracting this amount there won't be any
+		 * gaps between leases, since leases are reissued every time 50% of
+		 * the lease time elapses (see WalSndKeepaliveIfNecessary and
+		 * WalSndComputeSleepTime).
+		 */
+		int max_clock_skew = synchronous_replay_lease_time / 4;
+
+		/* Compute and remember the expiry time of the lease we're granting. */
+		synchronous_replay_last_lease =
+			TimestampTzPlusMilliseconds(now, synchronous_replay_lease_time);
+		/* Adjust the version we send for clock skew. */
+		synchronous_replay_lease =
+			TimestampTzPlusMilliseconds(synchronous_replay_last_lease,
+										-max_clock_skew);
+	}
+
 	/* construct the message... */
 	resetStringInfo(&output_message);
 	pq_sendbyte(&output_message, 'k');
+	pq_sendint64(&output_message, ++message_number);
 	pq_sendint64(&output_message, sentPtr);
-	pq_sendint64(&output_message, GetCurrentTimestamp());
+	pq_sendint64(&output_message, now);
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendint64(&output_message, synchronous_replay_lease);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
+
+	return message_number;
 }
 
 /*
@@ -3417,19 +3694,30 @@ WalSndKeepaliveIfNecessary(void)
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
-	if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
-		return;
-
-	if (waiting_for_ping_response)
-		return;
+	if (!am_potential_synchronous_replay_standby)
+	{
+		if (wal_sender_timeout <= 0 || last_reply_timestamp <= 0)
+			return;
+		if (waiting_for_ping_response)
+			return;
+	}
 
 	/*
 	 * If half of wal_sender_timeout has lapsed without receiving any reply
 	 * from the standby, send a keep-alive message to the standby requesting
 	 * an immediate reply.
+	 *
+	 * If synchronous replay has been configured, use
+	 * synchronous_replay_lease_time to control keepalive intervals rather
+	 * than wal_sender_timeout, so that we can keep replacing leases at the
+	 * right frequency.
 	 */
-	ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-											wal_sender_timeout / 2);
+	if (am_potential_synchronous_replay_standby)
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												synchronous_replay_lease_time / 2);
+	else
+		ping_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+												wal_sender_timeout / 2);
 	if (last_processing >= ping_time)
 	{
 		WalSndKeepalive(true);
@@ -3473,7 +3761,7 @@ LagTrackerWrite(XLogRecPtr lsn, TimestampTz local_flush_time)
 	 */
 	new_write_head = (lag_tracker->write_head + 1) % LAG_TRACKER_BUFFER_SIZE;
 	buffer_full = false;
-	for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; ++i)
+	for (i = 0; i < SYNC_REP_WAIT_SYNC_REPLAY; ++i)
 	{
 		if (new_write_head == lag_tracker->read_heads[i])
 			buffer_full = true;
diff --git a/src/backend/utils/errcodes.txt b/src/backend/utils/errcodes.txt
index 4f7b9b6e5c9..baa18e84ec7 100644
--- a/src/backend/utils/errcodes.txt
+++ b/src/backend/utils/errcodes.txt
@@ -308,6 +308,7 @@ Section: Class 40 - Transaction Rollback
 40001    E    ERRCODE_T_R_SERIALIZATION_FAILURE                              serialization_failure
 40003    E    ERRCODE_T_R_STATEMENT_COMPLETION_UNKNOWN                       statement_completion_unknown
 40P01    E    ERRCODE_T_R_DEADLOCK_DETECTED                                  deadlock_detected
+40P02    E    ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE                   synchronous_replay_not_available
 
 Section: Class 42 - Syntax Error or Access Rule Violation
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8681ada33a4..5dfb91603ca 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1795,6 +1795,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay", PGC_USERSET, REPLICATION_STANDBY,
+		 gettext_noop("Enables synchronous replay."),
+		 NULL
+		},
+		&synchronous_replay,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"syslog_sequence_numbers", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Add sequence number to syslog messages to avoid duplicate suppression."),
@@ -3156,6 +3166,28 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_replay_max_lag", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the maximum allowed replay lag before standbys are removed from the synchronous replay set."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_max_lag,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"synchronous_replay_lease_time", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Sets the duration of read leases granted to synchronous replay standbys."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&synchronous_replay_lease_time,
+		5000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4009,6 +4041,17 @@ static struct config_string ConfigureNamesString[] =
 		check_synchronous_standby_names, assign_synchronous_standby_names, NULL
 	},
 
+	{
+		{"synchronous_replay_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("List of names of potential synchronous replay standbys."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&synchronous_replay_standby_names,
+		"*",
+		check_synchronous_replay_standby_names, NULL, NULL
+	},
+
 	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c7f53470df4..d6069207dd1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -301,6 +301,17 @@
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
+#synchronous_replay_max_lag = 0s	# maximum replication delay to tolerate from
+					# standbys before dropping them from the synchronous
+					# replay set; 0 to disable synchronous replay
+
+#synchronous_replay_lease_time = 5s		# how long individual leases granted to
+					# synchronous replay standbys should last; should be 4 times
+					# the max possible clock skew
+
+#synchronous_replay_standby_names = '*'	# standby servers that can join the
+					# synchronous replay set; '*' = all
+
 # - Standby Servers -
 
 # These settings are ignored on a master server.
@@ -339,6 +350,14 @@
 					# (change requires restart)
 #max_sync_workers_per_subscription = 2	# taken from max_logical_replication_workers
 
+# - All Servers -
+
+#synchronous_replay = off			# "on" in any pair of consecutive
+					# transactions guarantees that the second
+					# can see the first (even if the second
+					# is run on a standby), or will raise an
+					# error to report that the standby is
+					# unavailable for synchronous replay
 
 #------------------------------------------------------------------------------
 # QUERY TUNING
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6e02585e10e..13cb7e39ab3 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -55,6 +55,8 @@
 #include "catalog/catalog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walreceiver.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -333,6 +335,17 @@ GetTransactionSnapshot(void)
 			elog(ERROR,
 				 "cannot take query snapshot during a parallel operation");
 
+		/*
+		 * In synchronous_replay mode on a standby, check if we have definitely
+		 * applied WAL for any COMMIT that returned successfully on the
+		 * primary.
+		 */
+		if (synchronous_replay && RecoveryInProgress() &&
+			!WalRcvSyncReplayAvailable())
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE),
+					 errmsg("standby is not available for synchronous replay")));
+
 		/*
 		 * In transaction-snapshot mode, the first snapshot must live until
 		 * end of xact regardless of what the caller does with it, so we must
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 10429a529d9..552340200da 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -117,7 +117,7 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	static XLogRecPtr last_written_lsn = InvalidXLogRecPtr;
 	static XLogRecPtr last_fsync_lsn = InvalidXLogRecPtr;
 
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	/*
@@ -150,6 +150,8 @@ sendFeedback(PGconn *conn, TimestampTz now, bool force, bool replyRequested)
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	startpos = output_written_lsn;
 	last_written_lsn = output_written_lsn;
@@ -467,6 +469,8 @@ StreamLogicalLog(void)
 			 * rest.
 			 */
 			pos = 1;			/* skip msgtype 'k' */
+			pos += 8;			/* skip messageNumber */
+
 			walEnd = fe_recvint64(&copybuf[pos]);
 			output_written_lsn = Max(walEnd, output_written_lsn);
 
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 692d13716e4..110b0cec25d 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -328,7 +328,7 @@ writeTimeLineHistoryFile(StreamCtl *stream, char *filename, char *content)
 static bool
 sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyRequested)
 {
-	char		replybuf[1 + 8 + 8 + 8 + 8 + 1];
+	char		replybuf[1 + 8 + 8 + 8 + 8 + 1 + 8];
 	int			len = 0;
 
 	replybuf[len] = 'r';
@@ -346,6 +346,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, TimestampTz now, bool replyReque
 	len += 8;
 	replybuf[len] = replyRequested ? 1 : 0; /* replyRequested */
 	len += 1;
+	fe_sendint64(-1, &replybuf[len]);	/* replyTo */
+	len += 8;
 
 	if (PQputCopyData(conn, replybuf, len) <= 0 || PQflush(conn))
 	{
@@ -1016,6 +1018,7 @@ ProcessKeepaliveMsg(PGconn *conn, StreamCtl *stream, char *copybuf, int len,
 	 * check if the server requested a reply, and ignore the rest.
 	 */
 	pos = 1;					/* skip msgtype 'k' */
+	pos += 8;					/* skip messageNumber */
 	pos += 8;					/* skip walEnd */
 	pos += 8;					/* skip sendTime */
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b8de13f03b9..20eec5705a8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5076,9 +5076,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,text,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,sync_replay,reply_time}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798e..e533e324482 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -833,7 +833,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_SYNC_REPLAY
 } WaitEventIPC;
 
 /* ----------
@@ -846,7 +847,8 @@ typedef enum
 {
 	WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
 	WAIT_EVENT_PG_SLEEP,
-	WAIT_EVENT_RECOVERY_APPLY_DELAY
+	WAIT_EVENT_RECOVERY_APPLY_DELAY,
+	WAIT_EVENT_SYNC_REPLAY_LEASE_REVOKE
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 913a8b08ce9..c78b450880f 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "utils/guc.h"
+#include "utils/timestamp.h"
 
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
@@ -24,8 +25,9 @@
 #define SYNC_REP_WAIT_WRITE		0
 #define SYNC_REP_WAIT_FLUSH		1
 #define SYNC_REP_WAIT_APPLY		2
+#define SYNC_REP_WAIT_SYNC_REPLAY	3
 
-#define NUM_SYNC_REP_WAIT_MODE	3
+#define NUM_SYNC_REP_WAIT_MODE	4
 
 /* syncRepState */
 #define SYNC_REP_NOT_WAITING		0
@@ -36,6 +38,12 @@
 #define SYNC_REP_PRIORITY		0
 #define SYNC_REP_QUORUM		1
 
+/* GUC variables */
+extern int synchronous_replay_max_lag;
+extern int synchronous_replay_lease_time;
+extern bool synchronous_replay;
+extern char *synchronous_replay_standby_names;
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -71,7 +79,7 @@ extern void SyncRepCleanupAtProcExit(void);
 
 /* called by wal sender */
 extern void SyncRepInitConfig(void);
-extern void SyncRepReleaseWaiters(void);
+extern void SyncRepReleaseWaiters(bool walsender_cr_available_or_joining);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
@@ -79,8 +87,12 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
 
+/* called by wal sender */
+extern bool SyncReplayPotentialStandby(void);
+
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+extern bool check_synchronous_replay_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_standby_names(const char *newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e04d725ff58..a3da5eec978 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -83,6 +83,13 @@ typedef struct
 	XLogRecPtr	receivedUpto;
 	TimeLineID	receivedTLI;
 
+	/*
+	 * syncReplayLease is the time until which the primary has authorized this
+	 * standby to consider itself available for synchronous_replay mode, or 0
+	 * for not authorized.
+	 */
+	TimestampTz syncReplayLease;
+
 	/*
 	 * latestChunkStart is the starting byte position of the current "batch"
 	 * of received WAL.  It's actually the same as the previous value of
@@ -313,4 +320,6 @@ extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
 extern void WalRcvForceReply(void);
 
+extern bool WalRcvSyncReplayAvailable(void);
+
 #endif							/* _WALRECEIVER_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0dd6d1cf808..567e99a4a04 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -28,6 +28,14 @@ typedef enum WalSndState
 	WALSNDSTATE_STOPPING
 } WalSndState;
 
+typedef enum SyncReplayState
+{
+	SYNC_REPLAY_UNAVAILABLE = 0,
+	SYNC_REPLAY_JOINING,
+	SYNC_REPLAY_AVAILABLE,
+	SYNC_REPLAY_REVOKING
+} SyncReplayState;
+
 /*
  * Each walsender has a WalSnd struct in shared memory.
  *
@@ -60,6 +68,10 @@ typedef struct WalSnd
 	TimeOffset	flushLag;
 	TimeOffset	applyLag;
 
+	/* Synchronous replay state for this walsender. */
+	SyncReplayState syncReplayState;
+	TimestampTz revokingUntil;
+
 	/* Protects shared variables shown above. */
 	slock_t		mutex;
 
@@ -106,6 +118,14 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Until when must commits in synchronous replay stall?  This is used to
+	 * wait for synchronous replay leases to expire when a walsender exists
+	 * uncleanly, and we must stall synchronous replay commits until we're
+	 * sure that the remote server's lease has expired.
+	 */
+	TimestampTz	revokingUntil;
+
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2c8e21baa7e..8ea1114608b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1862,9 +1862,10 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
+    w.sync_replay,
     w.reply_time
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, sslcompression, ssl_client_dn, ssl_client_serial, ssl_issuer_dn)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, sync_replay, reply_time) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_ssl| SELECT s.pid,
     s.ssl,
-- 
2.20.1