Sync Rep v19

Started by Simon Riggsalmost 15 years ago91 messages

simon@2ndQuadrant.com

almost 15 years ago

1 attachment(s)

Latest version of Sync Rep, which includes substantial internal changes
and simplifications from previous version. (25-30 changes).

Includes all outstanding technical comments, typos and docs. I will
continue to work on self review and test myself, though actively
encourage others to test and report issues.

Interesting changes

* docs updated

* names listed in synchronous_standby_names are now in priority order

* synchronous_standby_names = "*" matches all standby names

* pg_stat_replication now shows standby priority - this is an ordinal
number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
standby".

The only *currently* outstanding point of discussion is the "when to
wait" debate, which we aren't moving quickly towards consensus on at
this stage. I see that as a "How should it work?" debate and something
we can chew over during Alpha/Beta, not as an immediate blocker to
commit.

Please comment on the patch and also watch changes to the repo
git://github.com/simon2ndQuadrant/postgres.git

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachments:

sync_rep.v19.context.patchtext/x-patch; charset=UTF-8; name=sync_rep.v19.context.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2018,2023 **** SET ENABLE_SEQSCAN TO OFF;
--- 2018,2131 ----
       </variablelist>
      </sect2>
  
+     <sect2 id="runtime-config-sync-rep">
+      <title>Synchronous Replication</title>
+ 
+      <para>
+       These settings control the behavior of the built-in
+       <firstterm>synchronous replication</> feature.
+       These parameters would be set on the primary server that is
+       to send replication data to one or more standby servers.
+      </para>
+ 
+      <variablelist>
+      <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+       <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+       <indexterm>
+        <primary><varname>synchronous_replication</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies whether transaction commit will wait for WAL records
+         to be replicated before the command returns a <quote>success</>
+         indication to the client.  The default setting is <literal>off</>.
+         When <literal>on</>, there will be a delay while the client waits
+         for confirmation of successful replication. That delay will
+         increase depending upon the physical distance and network activity
+         between primary and standby. The commit wait will last until a
+         reply from the current synchronous standby indicates it has received
+         the commit record of the transaction. Synchronous standbys must
+         already have been defined (see <xref linkend="guc-sync-standby-names">).
+        </para>
+        <para>
+         This parameter can be changed at any time; the
+         behavior for any one transaction is determined by the setting in
+         effect when it commits.  It is therefore possible, and useful, to have
+         some transactions replicate synchronously and others asynchronously.
+         For example, to make a single multistatement transaction commit
+         asynchronously when the default is synchronous replication, issue
+         <command>SET LOCAL synchronous_replication TO OFF</> within the
+         transaction.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-sync-replication-timeout-client" xreflabel="sync_replication_timeout">
+       <term><varname>sync_replication_timeout</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>sync_replication_timeout</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         If the client has <varname>synchronous_replication</varname> set,
+         and a synchronous standby is currently available
+         then the commit will wait for up to <varname>replication_timeout_client</>
+         seconds before it returns a <quote>success</>. The commit will wait
+         forever for a confirmation when <varname>replication_timeout_client</>
+         is set to 0.
+        </para>
+        <para>
+         If the client has <varname>synchronous_replication</varname> set,
+ 		and yet no synchronous standby is available when we commit then we
+ 		don't wait at all.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-sync-standby-names" xreflabel="synchronous_standby_names">
+       <term><varname>synchronous_standby_names</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>synchronous_standby_names</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies a priority ordered list of standby names that can offer
+         synchronous replication.  At any one time there will be just one
+         synchronous standby that will wake sleeping users following commit.
+         The synchronous standby will be the first named standby that is
+         both currently connected and streaming in real-time to the standby
+         (as shown by a state of "STREAMING").  Other standby servers
+         with listed later will become potential synchronous standbys.
+         If the current synchronous standby disconnects for whatever reason
+         it will be replaced immediately with the next highest priority standby.
+         Specifying more than one standby name can allow very high availability.
+        </para>
+        <para>
+         The standby name is currently taken as the application_name of the
+         standby, as set in the primary_conninfo on the standby. Names are
+         not enforced for uniqueness. In case of duplicates one of the standbys
+         will be chosen to be the synchronous standby, though exactly which
+         one is indeterminate.
+        </para>
+        <para>
+         The default is the special entry <literal>*</> which matches any
+         application_name, including the default application name of
+         <literal>walsender</>. This is not recommended and a more carefully
+         thought through configuration will be desirable.
+        </para>
+        <para>
+         If a standby is removed from the list of servers then it will stop
+         being the synchronous standby, allowing another to take it's place.
+         If the list is empty, synchronous replication will not be
+         possible, whatever the setting of <varname>synchronous_replication</>.
+         Standbys may also be added to the list without restarting the server.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      </variablelist>
+     </sect2>
+ 
      <sect2 id="runtime-config-standby">
      <title>Standby Servers</title>
  
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 875,880 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 875,1107 ----
     </sect3>
  
    </sect2>
+   <sect2 id="synchronous-replication">
+    <title>Synchronous Replication</title>
+ 
+    <indexterm zone="high-availability">
+     <primary>Synchronous Replication</primary>
+    </indexterm>
+ 
+    <para>
+     <productname>PostgreSQL</> streaming replication is asynchronous by
+     default. If the primary server
+     crashes then some transactions that were committed may not have been
+     replicated to the standby server, causing data loss. The amount
+     of data loss is proportional to the replication delay at the time of
+     failover.
+    </para>
+ 
+    <para>
+ 	Synchronous replication offers the ability to confirm that all changes
+ 	made by a transaction have been transferred to one synchronous standby
+ 	server. This extends the standard level of durability
+ 	offered by a transaction commit. This level of protection is referred
+ 	to as 2-safe replication in computer science theory.
+    </para>
+ 
+    <para>
+ 	When requesting synchronous replication, each commit of a
+ 	write transaction will wait until confirmation is
+ 	received that the commit has been written to the transaction log on disk
+ 	of both the primary and standby server. The only possibility that data
+ 	can be lost is if both the primary and the standby suffer crashes at the
+ 	same time. This can provide a much higher level of durability, though only
+ 	if the sysadmin is cautious about the placement and management of the two
+ 	servers.  Waiting for confirmation increases the user's confidence that the
+ 	changes will not be lost in the event of server crashes but it also
+ 	necessarily increases the response time for the requesting transaction.
+ 	The minimum wait time is the roundtrip time between primary to standby.
+    </para>
+ 
+    <para>
+ 	Read only transactions and transaction rollbacks need not wait for
+ 	replies from standby servers. Subtransaction commits do not wait for
+ 	responses from standby servers, only top-level commits. Long
+ 	running actions such as data loading or index building do not wait
+ 	until the very final commit message. All two-phase commit actions
+ 	require commit waits, including both prepare and commit.
+    </para>
+ 
+    <sect3 id="synchronous-replication-config">
+     <title>Basic Configuration</title>
+ 
+    <para>
+     All parameters have useful default values, so we can enable
+     synchronous replication easily just by setting this on the primary
+ 
+ <programlisting>
+ synchronous_replication = on
+ </programlisting>
+ 
+ 	When <varname>synchronous_replication</> is set, a commit will wait
+ 	for up to <varname>synchronous_replication_timeout</> seconds to
+ 	confirm that the standby has received the commit record. Both
+ 	<varname>synchronous_replication</> and
+ 	<varname>synchronous_replication_timeout</> can be set by individual
+ 	users, so can be configured in the configuration file, for particular
+ 	users or databases, or dynamically by applications programs.
+ 	It is possible for user sessions to reach timeout even though
+ 	standbys are communicating normally. In that case, the setting of
+ 	<varname>synchronous_replication_timeout</> is probably too low though
+ 	you probably have other system or network issues as well.
+    </para>
+ 
+    <para>
+     After a commit record has been written to disk on the primary the
+     WAL record is then sent to the standby. The standby sends reply
+     messages each time a new batch of WAL data is received, unless
+ 	<varname>wal_receiver_status_interval</> is set to zero on the standby.
+ 	If the standby is the first matching standby, as specified in
+ 	<varname>synchronous_standby_names</> on the primary, the reply
+ 	messages from that standby will be used to wake users waiting for
+ 	confirmation the commit record has been received. These parameters
+ 	allow the administrator to specify which standby servers should be
+ 	synchronous standbys. Note that the configuration of synchronous
+ 	replication is mainly on the master.
+    </para>
+ 
+    <para>
+     The default setting of <varname>synchronous_replication_timeout</> is
+     120 seconds to ensure that users do not wait forever if all specified
+     standby servers go down. If you wish to have stronger guarantees the
+     timeout can be set higher, or even to zero, meaning wait forever.
+     Users will stop waiting if a fast shutdown is requested, though the
+     server does not fully shutdown until all outstanding WAL records are
+     transferred to standby servers.
+    </para>
+ 
+    <para>
+     Note also that <varname>synchronous_commit</> is used when the user
+     specifies <varname>synchronous_replication</>, overriding even an
+     explicit setting of <varname>synchronous_commit</> to <literal>off</>.
+     This is because we must write WAL to disk on primary before we replicate
+     to ensure the standby never gets ahead of the primary.
+    </para>
+ 
+    </sect3>
+ 
+    <sect3 id="synchronous-replication-performance">
+     <title>Planning for Performance</title>
+ 
+    <para>
+ 	Synchronous replication usually requires carefully planned and placed
+ 	standby servers to ensure applications perform acceptably. Waiting
+ 	doesn't utilise system resources, but transaction locks continue to be
+ 	held until the transfer is confirmed. As a result, incautious use of
+ 	synchronous replication will reduce performance for database
+ 	applications because of increased response times and higher contention.
+    </para>
+ 
+    <para>
+ 	<productname>PostgreSQL</> allows the application developer
+ 	to specify the durability level required via replication. This can be
+ 	specified for the system overall, though it can also be specified for
+ 	specific users or connections, or even individual transactions.
+    </para>
+ 
+    <para>
+ 	For example, an application workload might consist of:
+ 	10% of changes are important customer details, while
+ 	90% of changes are less important data that the business can more
+ 	easily survive if it is lost, such as chat messages between users.
+    </para>
+ 
+    <para>
+ 	With synchronous replication options specified at the application level
+ 	(on the primary) we can offer sync rep for the most important changes,
+ 	without slowing down the bulk of the total workload. Application level
+ 	options are an important and practical tool for allowing the benefits of
+ 	synchronous replication for high performance applications.
+    </para>
+ 
+    <para>
+ 	You should consider that the network bandwidth must be higher than
+ 	the rate of generation of WAL data.
+ 	10% of changes are important customer details, while
+ 	90% of changes are less important data that the business can more
+ 	easily survive if it is lost, such as chat messages between users.
+    </para>
+ 
+    </sect3>
+ 
+    <sect3 id="synchronous-replication-ha">
+     <title>Planning for High Availability</title>
+ 
+    <para>
+     The easiest and safest method of gaining High Availability using
+     synchronous replication is to configure at least two standby servers.
+     To understand why, we need to examine what can happen when you lose all
+     standby servers.
+    </para>
+ 
+    <para>
+     Commits made when synchronous_replication is set will wait until at
+     the sync standby responds. The response may never occur if the last,
+     or only, standby should crash or the network drops. What should we do in
+     that situation?
+    </para>
+ 
+    <para>
+     If a standby was available immediately after commit we will wait.
+     Sitting and waiting will typically cause operational problems
+ 	because it is an effective outage of the primary server should all
+ 	sessions end up waiting. This is why we offer the facility to set
+ 	<varname>synchronous_replication_timeout</>.
+    </para>
+ 
+    <para>
+     Once the last synchronous standby has been lost we allow transactions
+     to skip waiting, since we know there isn't anybody to reply, or at
+     least we might expect it to be some time before one returns. You will
+     note that this provides high availability but a primary server working
+     alone could allow changes that are not replicated to other servers,
+     placing your data at risk if the primary fails also.
+    </para>
+ 
+    <para>
+ 	The best solution for avoiding data loss is to ensure you don't lose
+ 	your last remaining sync standby. This can be achieved by naming multiple
+ 	potential synchronous standbys using <varname>synchronous_standby_names</>.
+ 	The first named standby will be used as the synchronous standby. Standbys
+ 	listed after this will takeover the role of synchronous standby if the
+ 	first one should fail.
+    </para>
+ 
+    <para>
+ 	When a standby first attaches to the primary, it will not yet be properly
+ 	synchronized. This is described as <literal>CATCHUP</> mode. Once
+ 	the lag between standby and primary reaches zero for the first time
+ 	we move to real-time <literal>STREAMING</> state.
+ 	The catch-up duration may be long immediately after the standby has
+ 	been created. If the standby is shutdown, then the catch-up period
+ 	will increase according to the length of time the standby has been down.
+ 	The standby is only able to become a synchronous standby
+ 	once it has reached <literal>STREAMING</> state.
+    </para>
+ 
+    <para>
+ 	If primary crashes while commits are waiting for acknowledgement, those
+ 	waiting transactions will be marked fully committed once the primary
+ 	database recovers.
+ 	There is no way to be certain that all standbys have received all
+ 	outstanding WAL data at time of the crash of the primary. Some
+ 	transactions may not show as committed on the standby, even though
+ 	they show as committed on the primary. The guarantee we offer is that
+ 	the application will not receive explicit acknowledgement of the
+ 	successful commit of a transaction until the WAL data is known to be
+ 	safely received by the standby.
+    </para>
+ 
+    <para>
+ 	If you need to re-create a standby server while transactions are
+ 	waiting, make sure that the commands to run pg_start_backup() and
+ 	pg_stop_backup() are run in a session with
+ 	synchronous_replication = off, otherwise those requests will wait
+ 	forever for the standby to appear.
+    </para>
+ 
+    </sect3>
+   </sect2>
    </sect1>
  
    <sect1 id="warm-standby-failover">
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 56,61 ****
--- 56,62 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "replication/walsender.h"
+ #include "replication/syncrep.h"
  #include "storage/fd.h"
  #include "storage/predicate.h"
  #include "storage/procarray.h"
***************
*** 1071,1076 **** EndPrepare(GlobalTransaction gxact)
--- 1072,1085 ----
  
  	END_CRIT_SECTION();
  
+ 	/*
+ 	 * Wait for synchronous replication, if required.
+ 	 *
+ 	 * Note that at this stage we have marked the prepare, but still show as
+ 	 * running in the procarray (twice!) and continue to hold locks.
+ 	 */
+ 	SyncRepWaitForLSN(gxact->prepare_lsn);
+ 
  	records.tail = records.head = NULL;
  }
  
***************
*** 2030,2035 **** RecordTransactionCommitPrepared(TransactionId xid,
--- 2039,2052 ----
  	MyProc->inCommit = false;
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for synchronous replication, if required.
+ 	 *
+ 	 * Note that at this stage we have marked clog, but still show as
+ 	 * running in the procarray and continue to hold locks.
+ 	 */
+ 	SyncRepWaitForLSN(recptr);
  }
  
  /*
***************
*** 2109,2112 **** RecordTransactionAbortPrepared(TransactionId xid,
--- 2126,2137 ----
  	TransactionIdAbortTree(xid, nchildren, children);
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for synchronous replication, if required.
+ 	 *
+ 	 * Note that at this stage we have marked clog, but still show as
+ 	 * running in the procarray and continue to hold locks.
+ 	 */
+ 	SyncRepWaitForLSN(recptr);
  }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 37,42 ****
--- 37,43 ----
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "replication/walsender.h"
+ #include "replication/syncrep.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/lmgr.h"
***************
*** 1055,1061 **** RecordTransactionCommit(void)
  	 * if all to-be-deleted tables are temporary though, since they are lost
  	 * anyway if we crash.)
  	 */
! 	if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0)
  	{
  		/*
  		 * Synchronous commit case:
--- 1056,1062 ----
  	 * if all to-be-deleted tables are temporary though, since they are lost
  	 * anyway if we crash.)
  	 */
! 	if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
  	{
  		/*
  		 * Synchronous commit case:
***************
*** 1125,1130 **** RecordTransactionCommit(void)
--- 1126,1139 ----
  	/* Compute latestXid while we have the child XIDs handy */
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
+ 	/*
+ 	 * Wait for synchronous replication, if required.
+ 	 *
+ 	 * Note that at this stage we have marked clog, but still show as
+ 	 * running in the procarray and continue to hold locks.
+ 	 */
+ 	SyncRepWaitForLSN(XactLastRecEnd);
+ 
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd.xrecoff = 0;
  
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***************
*** 521,526 **** CREATE VIEW pg_stat_replication AS
--- 521,527 ----
              W.write_location,
              W.flush_location,
              W.replay_location
+             W.sync_priority
      FROM pg_stat_get_activity(NULL) AS S, pg_authid U,
              pg_stat_get_wal_senders() AS W
      WHERE S.usesysid = U.oid AND
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
***************
*** 1527,1532 **** AutoVacWorkerMain(int argc, char *argv[])
--- 1527,1539 ----
  	SetConfigOption("statement_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
  
  	/*
+ 	 * Force synchronous replication off to allow regular maintenance even
+ 	 * if we are waiting for standbys to connect. This is important to
+ 	 * ensure we aren't blocked from performing anti-wraparound tasks.
+ 	 */
+ 	SetConfigOption("synchronous_replication", "off", PGC_SUSET, PGC_S_OVERRIDE);
+ 
+ 	/*
  	 * Get the info about the database we're going to work on.
  	 */
  	LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 1836,1842 **** retry1:
  					 errmsg("the database system is starting up")));
  			break;
  		case CAC_SHUTDOWN:
! 			ereport(FATAL,
  					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
  					 errmsg("the database system is shutting down")));
  			break;
--- 1836,1843 ----
  					 errmsg("the database system is starting up")));
  			break;
  		case CAC_SHUTDOWN:
! 			if (!am_walsender)
! 				ereport(FATAL,
  					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
  					 errmsg("the database system is shutting down")));
  			break;
*** a/src/backend/replication/Makefile
--- b/src/backend/replication/Makefile
***************
*** 13,19 **** top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global
  
  OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
! 	repl_gram.o
  
  include $(top_srcdir)/src/backend/common.mk
  
--- 13,19 ----
  include $(top_builddir)/src/Makefile.global
  
  OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
! 	repl_gram.o syncrep.o
  
  include $(top_srcdir)/src/backend/common.mk
  
*** /dev/null
--- b/src/backend/replication/syncrep.c
***************
*** 0 ****
--- 1,617 ----
+ /*-------------------------------------------------------------------------
+  *
+  * syncrep.c
+  *
+  * Synchronous replication is new as of PostgreSQL 9.1.
+  *
+  * If requested, transaction commits wait until their commit LSN is
+  * acknowledged by the standby, or the wait hits timeout.
+  *
+  * This module contains the code for waiting and release of backends.
+  * All code in this module executes on the primary. The core streaming
+  * replication transport remains within WALreceiver/WALsender modules.
+  *
+  * The essence of this design is that it isolates all logic about
+  * waiting/releasing onto the primary. The primary defines which standbys
+  * it wishes to wait for. The standby is completely unaware of the
+  * durability requirements of transactions on the primary, reducing the
+  * complexity of the code and streamlining both standby operations and
+  * network bandwidth because there is no requirement to ship
+  * per-transaction state information.
+  *
+  * The bookeeping approach we take is that a commit is either synchronous
+  * or not synchronous (async). If it is async, we just fastpath out of
+  * here. If it is sync, then in 9.1 we wait for the flush location on the
+  * standby before releasing the waiting backend. Further complexity
+  * in that interaction is expected in later releases.
+  *
+  * The best performing way to manage the waiting backends is to have a
+  * single ordered queue of waiting backends, so that we can avoid
+  * searching the through all waiters each time we receive a reply.
+  *
+  * Starting sync replication is a multi stage process. First, the standby
+  * must be a potential synchronous standby. Next, we must have caught up
+  * with the primary; that may take some time. If there is no current
+  * synchronous standby then the WALsender will offer a sync rep service.
+  *
+  * Portions Copyright (c) 2010-2011, PostgreSQL Global Development Group
+  *
+  * IDENTIFICATION
+  *	  $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include <unistd.h>
+ 
+ #include "access/xact.h"
+ #include "access/xlog_internal.h"
+ #include "miscadmin.h"
+ #include "postmaster/autovacuum.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
+ #include "storage/latch.h"
+ #include "storage/ipc.h"
+ #include "storage/pmsignal.h"
+ #include "storage/proc.h"
+ #include "utils/builtins.h"
+ #include "utils/guc.h"
+ #include "utils/guc_tables.h"
+ #include "utils/memutils.h"
+ #include "utils/ps_status.h"
+ 
+ /* User-settable parameters for sync rep */
+ bool	sync_rep_mode = false;			/* Only set in user backends */
+ int		sync_rep_timeout = 120;			/* Only set in user backends */
+ char 	*SyncRepStandbyNames;
+ 
+ bool	WaitingForSyncRep = false;	/* Global state for some exit methods */
+ 
+ #define	IsOnSyncRepQueue()		(MyProc->lwWaiting)
+ 
+ static bool announce_next_takeover = true;
+ 
+ static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN);
+ static void SyncRepRemoveFromQueue(void);
+ static void SyncRepAddToQueue(void);
+ static long SyncRepGetWaitTimeout(void);
+ 
+ static int SyncRepGetStandbyPriority(void);
+ static int SyncRepWakeQueue(void);
+ 
+ 
+ /*
+  * ===========================================================
+  * Synchronous Replication functions for normal user backends
+  * ===========================================================
+  */
+ 
+ /*
+  * Wait for synchronous replication, if requested by user.
+  */
+ void
+ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+ {
+ 	/*
+ 	 * Fast exit if user has not requested sync replication, or
+ 	 * streaming replication is inactive in this server.
+ 	 */
+ 	if (!SyncRepRequested() || max_wal_senders == 0)
+ 		return;
+ 
+ 	/*
+ 	 * Wait on queue. We check for a fast exit once we have the lock.
+ 	 */
+ 	SyncRepWaitOnQueue(XactCommitLSN);
+ }
+ 
+ void
+ SyncRepCleanupAtProcExit(int code, Datum arg)
+ {
+ 	if (IsOnSyncRepQueue())
+ 	{
+ 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 		SyncRepRemoveFromQueue();
+ 		LWLockRelease(SyncRepLock);
+ 	}
+ 
+ 	if (MyProc != NULL)
+ 		DisownLatch(&MyProc->waitLatch);
+ }
+ 
+ /*
+  * Wait for specified LSN to be confirmed at the requested level
+  * of durability. Each proc has its own wait latch, so we perform
+  * a normal latch check/wait loop here.
+  */
+ static void
+ SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
+ {
+ 	volatile WalSndCtlData *walsndctl = WalSndCtl;
+ 	volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue);
+ 	TimestampTz	now = GetCurrentTransactionStopTimestamp();
+ 	long		timeout = SyncRepGetWaitTimeout();
+ 	char 		*new_status = NULL;
+ 	const char *old_status;
+ 	int			len;
+ 	bool		wait_on_queue = false;
+ 
+ 	ereport(DEBUG3,
+ 			(errmsg("synchronous replication waiting for %X/%X starting at %s",
+ 						XactCommitLSN.xlogid,
+ 						XactCommitLSN.xrecoff,
+ 						timestamptz_to_str(GetCurrentTransactionStopTimestamp()))));
+ 
+ 	for (;;)
+ 	{
+ 		ResetLatch(&MyProc->waitLatch);
+ 
+ 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 
+ 		/*
+ 		 * First time through, add ourselves to the queue.
+ 		 */
+ 		if (!IsOnSyncRepQueue())
+ 		{
+ 			int i;
+ 
+ 			/*
+ 			 * Wait no longer if we have already reached our LSN
+ 			 */
+ 			if (XLByteLE(XactCommitLSN, queue->lsn))
+ 			{
+ 				/* No need to wait */
+ 				LWLockRelease(SyncRepLock);
+ 				return;
+ 			}
+ 
+ 			/*
+ 			 * Check that we have at least one sync standby active that
+ 			 * has caught up with the primary.
+ 			 */
+ 			for (i = 0; i < max_wal_senders; i++)
+ 			{
+ 				/* use volatile pointer to prevent code rearrangement */
+ 				volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+ 
+ 				if (walsnd->pid != 0 &&
+ 					walsnd->sync_standby_priority > 0 &&
+ 					walsnd->state == WALSNDSTATE_STREAMING)
+ 				{
+ 					wait_on_queue = true;
+ 					break;
+ 				}
+ 			}
+ 
+ 			/*
+ 			 * Leave quickly if we don't have a sync standby that will
+ 			 * confirm it has received our commit.
+ 			 */
+ 			if (!wait_on_queue)
+ 			{
+ 				LWLockRelease(SyncRepLock);
+ 				return;
+ 			}
+ 
+ 			/*
+ 			 * Set our waitLSN so WALSender will know when to wake us.
+ 			 * We set this before we add ourselves to queue, so that
+ 			 * any proc on the queue can be examined freely without
+ 			 * taking a lock on each process in the queue.
+ 			 */
+ 			MyProc->waitLSN = XactCommitLSN;
+ 			SyncRepAddToQueue();
+ 			LWLockRelease(SyncRepLock);
+ 			WaitingForSyncRep = true;
+ 
+ 			/*
+ 			 * Alter ps display to show waiting for sync rep.
+ 			 */
+ 			if (update_process_title)
+ 			{
+ 				old_status = get_ps_display(&len);
+ 				new_status = (char *) palloc(len + 21 + 1);
+ 				memcpy(new_status, old_status, len);
+ 				strcpy(new_status + len, " waiting for sync rep");
+ 				set_ps_display(new_status, false);
+ 				new_status[len] = '\0'; /* truncate off " waiting" */
+ 			}
+ 		}
+ 		else
+ 		{
+ 			bool release = false;
+ 			bool timed_out = false;
+ 
+ 			/*
+ 			 * Check the LSN on our queue and if it's moved far enough then
+ 			 * remove us from the queue. First time through this is
+ 			 * unlikely to be far enough, yet is possible. Next time we are
+ 			 * woken we should be more lucky.
+ 			 */
+ 			if (XLByteLE(XactCommitLSN, queue->lsn))
+ 				release = true;
+ 			else if (timeout > 0 &&
+ 				TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+ 											now, timeout))
+ 			{
+ 				release = true;
+ 				timed_out = true;
+ 			}
+ 
+ 			if (release)
+ 			{
+ 				SyncRepRemoveFromQueue();
+ 				LWLockRelease(SyncRepLock);
+ 				WaitingForSyncRep = false;
+ 
+ 				/*
+ 				 * Reset our waitLSN.
+ 				 */
+ 				MyProc->waitLSN.xlogid = 0;
+ 				MyProc->waitLSN.xrecoff = 0;
+ 
+ 				if (new_status)
+ 				{
+ 					/* Reset ps display */
+ 					set_ps_display(new_status, false);
+ 					pfree(new_status);
+ 				}
+ 
+ 				/*
+ 				 * Our response to the timeout is to simply post a NOTICE and
+ 				 * then return to the user. The commit has happened, we just
+ 				 * haven't been able to verify it has been replicated in the
+ 				 * way requested.
+ 				 */
+ 				if (timed_out)
+ 					ereport(NOTICE,
+ 							(errmsg("synchronous replication timeout at %s",
+ 										timestamptz_to_str(now))));
+ 				else
+ 					ereport(DEBUG3,
+ 							(errmsg("synchronous replication wait complete at %s",
+ 										timestamptz_to_str(now))));
+ 				return;
+ 			}
+ 
+ 			LWLockRelease(SyncRepLock);
+ 		}
+ 
+ 		WaitLatch(&MyProc->waitLatch, timeout);
+ 		now = GetCurrentTimestamp();
+ 	}
+ }
+ 
+ /*
+  * Remove myself from sync rep wait queue.
+  *
+  * Assume on queue at start; will not be on queue at end.
+  * Queue is already locked at start and remains locked on exit.
+  */
+ static void
+ SyncRepRemoveFromQueue(void)
+ {
+ 	volatile WalSndCtlData *walsndctl = WalSndCtl;
+ 	volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue);
+ 	PGPROC	*proc = queue->head;
+ 
+ 	Assert(IsOnSyncRepQueue());
+ 
+ 	proc = queue->head;
+ 
+ 	if (proc == MyProc)
+ 	{
+ 		if (MyProc->lwWaitLink == NULL)
+ 		{
+ 			/*
+ 			 * We were the only waiter on the queue. Reset head and tail.
+ 			 */
+ 			Assert(queue->tail == MyProc);
+ 			queue->head = NULL;
+ 			queue->tail = NULL;
+ 		}
+ 		else
+ 			/*
+ 			 * Move head to next proc on the queue.
+ 			 */
+ 			queue->head = MyProc->lwWaitLink;
+ 	}
+ 	else
+ 	{
+ 		bool	found = false;
+ 
+ 		while (proc->lwWaitLink != NULL)
+ 		{
+ 			/* Are we the next proc in our traversal of the queue? */
+ 			if (proc->lwWaitLink == MyProc)
+ 			{
+ 				/*
+ 				 * Remove ourselves from middle of queue.
+ 				 * No need to touch head or tail.
+ 				 */
+ 				proc->lwWaitLink = MyProc->lwWaitLink;
+ 				found = true;
+ 				break;
+ 			}
+ 
+ 			proc = proc->lwWaitLink;
+ 		}
+ 
+ 		if (!found)
+ 			elog(WARNING, "could not locate ourselves on wait queue");
+ 
+ 		if (proc->lwWaitLink == NULL)	/* At tail */
+ 		{
+ 			Assert(proc != MyProc);
+ 			/* Remove ourselves from tail of queue */
+ 			Assert(queue->tail == MyProc);
+ 			queue->tail = proc;
+ 			proc->lwWaitLink = NULL;
+ 		}
+ 	}
+ 	MyProc->lwWaitLink = NULL;
+ 	MyProc->lwWaiting = false;
+ }
+ 
+ /*
+  * Add myself to sync rep wait queue.
+  *
+  * Assume not on queue at start; will be on queue at end.
+  * Queue is already locked at start and remains locked on exit.
+  */
+ static void
+ SyncRepAddToQueue(void)
+ {
+ 	volatile WalSndCtlData *walsndctl = WalSndCtl;
+ 	volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue);
+ 	PGPROC	*tail = queue->tail;
+ 
+ 	/*
+ 	 * Add myself to tail of wait queue.
+ 	 */
+ 	if (tail == NULL)
+ 	{
+ 		queue->head = MyProc;
+ 		queue->tail = MyProc;
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * XXX extra code needed here to maintain sorted invariant.
+ 		 * Our approach should be same as racing car - slow in, fast out.
+ 		 */
+ 		Assert(tail->lwWaitLink == NULL);
+ 		tail->lwWaitLink = MyProc;
+ 	}
+ 	queue->tail = MyProc;
+ 
+ 	MyProc->lwWaiting = true;
+ 	MyProc->lwWaitLink = NULL;
+ }
+ 
+ /*
+  * Return a value that we can use directly in WaitLatch(). We need to
+  * handle special values, plus convert from seconds to microseconds.
+  *
+  */
+ static long
+ SyncRepGetWaitTimeout(void)
+ {
+ 	if (sync_rep_timeout == 0)
+ 		return -1L;
+ 
+ 	return 1000000L * sync_rep_timeout;
+ }
+ 
+ /*
+  * ===========================================================
+  * Synchronous Replication functions for wal sender processes
+  * ===========================================================
+  */
+ 
+ /*
+  * Take any action required to initialise sync rep state from config
+  * data. Called at WALSender startup and after each SIGHUP.
+  */
+ void
+ SyncRepInitConfig(void)
+ {
+ 	int priority;
+ 
+ 	/*
+ 	 * Determine if we are a potential sync standby and remember the result
+ 	 * for handling replies from standby.
+ 	 */
+ 	priority = SyncRepGetStandbyPriority();
+ 	if (MyWalSnd->sync_standby_priority != priority)
+ 	{
+ 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 		MyWalSnd->sync_standby_priority = priority;
+ 		LWLockRelease(SyncRepLock);
+ 		ereport(DEBUG1,
+ 				(errmsg("standby \"%s\" now has synchronous standby priority %u",
+ 						application_name, priority)));
+ 	}
+ }
+ 
+ /*
+  * Update the LSNs on each queue based upon our latest state. This
+  * implements a simple policy of first-valid-standby-releases-waiter.
+  *
+  * Other policies are possible, which would change what we do here and what
+  * perhaps also which information we store as well.
+  */
+ void
+ SyncRepReleaseWaiters(void)
+ {
+ 	volatile WalSndCtlData *walsndctl = WalSndCtl;
+ 	volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue);
+ 	volatile WalSnd *syncWalSnd = NULL;
+ 	int 		numprocs = 0;
+ 	int			priority = 0;
+ 	int			i;
+ 
+ 	/*
+ 	 * If this WALSender is serving a standby that is not on the list of
+ 	 * potential standbys then we have nothing to do. If we are still
+ 	 * starting up or still running base backup, then leave quicly also.
+ 	 */
+ 	if (MyWalSnd->sync_standby_priority == 0 ||
+ 		MyWalSnd->state < WALSNDSTATE_CATCHUP)
+ 		return;
+ 
+ 	/*
+ 	 * We're a potential sync standby. Release waiters if we are the
+ 	 * highest priority standby. We do this even if the standby is not yet
+ 	 * caught up, in case this is a restart situation and
+ 	 * there are backends waiting for us. That allows backends to exit the
+ 	 * wait state even if new backends cannot yet enter the wait state.
+ 	 */
+ 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 
+ 	for (i = 0; i < max_wal_senders; i++)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
+ 
+ 		if (walsnd->pid != 0 &&
+ 			walsnd->sync_standby_priority > 0 &&
+ 			(priority == 0 ||
+ 			 priority < walsnd->sync_standby_priority))
+ 		{
+ 			 priority = walsnd->sync_standby_priority;
+ 			 syncWalSnd = walsnd;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * We should have found ourselves at least.
+ 	 */
+ 	Assert(syncWalSnd);
+ 
+ 	/*
+ 	 * If we aren't managing the highest priority standby then just leave.
+ 	 */
+ 	if (syncWalSnd != MyWalSnd)
+ 	{
+ 		LWLockRelease(SyncRepLock);
+ 		announce_next_takeover = true;
+ 		return;
+ 	}
+ 
+ 	if (XLByteLT(queue->lsn, MyWalSnd->flush))
+ 	{
+ 		/*
+ 		 * Set the lsn first so that when we wake backends they will
+ 		 * release up to this location.
+ 		 */
+ 		queue->lsn = MyWalSnd->flush;
+ 		numprocs = SyncRepWakeQueue();
+ 	}
+ 
+ 	LWLockRelease(SyncRepLock);
+ 
+ 	elog(DEBUG3, "released %d procs up to %X/%X",
+ 					numprocs,
+ 					MyWalSnd->flush.xlogid,
+ 					MyWalSnd->flush.xrecoff);
+ 
+ 	/*
+ 	 * If we are managing the highest priority standby, though we weren't
+ 	 * prior to this, then announce we are now the sync standby.
+ 	 */
+ 	if (announce_next_takeover)
+ 	{
+ 		announce_next_takeover = false;
+ 		ereport(LOG,
+ 				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
+ 						application_name, MyWalSnd->sync_standby_priority)));
+ 	}
+ }
+ 
+ /*
+  * Check if we are in the list of sync standbys, and if so, determine
+  * priority sequence. Return priority if set, or zero to indicate that
+  * we are not a potential sync standby.
+  *
+  * Compare the parameter SyncRepStandbyNames against the application_name
+  * for this WALSender, or allow any name if we find a wildcard "*".
+  */
+ static int
+ SyncRepGetStandbyPriority(void)
+ {
+ 	char	   *rawstring;
+ 	List	   *elemlist;
+ 	ListCell   *l;
+ 	int			priority = 0;
+ 	bool		found = false;
+ 
+ 	/* Need a modifiable copy of string */
+ 	rawstring = pstrdup(SyncRepStandbyNames);
+ 
+ 	/* Parse string into list of identifiers */
+ 	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+ 	{
+ 		/* syntax error in list */
+ 		pfree(rawstring);
+ 		list_free(elemlist);
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 		   errmsg("invalid list syntax for parameter \"synchronous_standby_names\"")));
+ 		return 0;
+ 	}
+ 
+ 	foreach(l, elemlist)
+ 	{
+ 		char	   *standby_name = (char *) lfirst(l);
+ 
+ 		priority++;
+ 
+ 		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+ 			pg_strcasecmp(standby_name, "*") == 0)
+ 		{
+ 			found = true;
+ 			break;
+ 		}
+ 	}
+ 
+ 	pfree(rawstring);
+ 	list_free(elemlist);
+ 
+ 	return (found ? priority : 0);
+ }
+ 
+ /*
+  * Walk queue from head setting the latches of any procs that need
+  * to be woken. We don't modify the queue, we leave that for individual
+  * procs to release themselves.
+  *
+  * Must hold SyncRepLock
+  */
+ static int
+ SyncRepWakeQueue(void)
+ {
+ 	volatile WalSndCtlData *walsndctl = WalSndCtl;
+ 	volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue);
+ 	PGPROC	*proc = queue->head;
+ 	int		numprocs = 0;
+ 
+ 	/* fast exit for empty queue */
+ 	if (proc == NULL)
+ 		return 0;
+ 
+ 	for (; proc != NULL; proc = proc->lwWaitLink)
+ 	{
+ 		/*
+ 		 * Assume the queue is ordered by LSN
+ 		 */
+ 		if (XLByteLT(queue->lsn, proc->waitLSN))
+ 			return numprocs;
+ 
+ 		numprocs++;
+ 		SetLatch(&proc->waitLatch);
+ 	}
+ 
+ 	return numprocs;
+ }
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 66,72 ****
  WalSndCtlData *WalSndCtl = NULL;
  
  /* My slot in the shared memory array */
! static WalSnd *MyWalSnd = NULL;
  
  /* Global state */
  bool		am_walsender = false;		/* Am I a walsender process ? */
--- 66,72 ----
  WalSndCtlData *WalSndCtl = NULL;
  
  /* My slot in the shared memory array */
! WalSnd *MyWalSnd = NULL;
  
  /* Global state */
  bool		am_walsender = false;		/* Am I a walsender process ? */
***************
*** 174,179 **** WalSenderMain(void)
--- 174,181 ----
  		SpinLockRelease(&walsnd->mutex);
  	}
  
+ 	SyncRepInitConfig();
+ 
  	/* Main loop of walsender */
  	return WalSndLoop();
  }
***************
*** 584,589 **** ProcessStandbyReplyMessage(void)
--- 586,593 ----
  		walsnd->apply = reply.apply;
  		SpinLockRelease(&walsnd->mutex);
  	}
+ 
+ 	SyncRepReleaseWaiters();
  }
  
  /*
***************
*** 700,705 **** WalSndLoop(void)
--- 704,710 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 			SyncRepInitConfig();
  		}
  
  		/*
***************
*** 771,777 **** WalSndLoop(void)
--- 776,787 ----
  		 * that point might wait for some time.
  		 */
  		if (MyWalSnd->state == WALSNDSTATE_CATCHUP && caughtup)
+ 		{
+ 			ereport(DEBUG1,
+ 					(errmsg("standby \"%s\" has now caught up with primary",
+ 									application_name)));
  			WalSndSetState(WALSNDSTATE_STREAMING);
+ 		}
  
  		ProcessRepliesIfAny();
  	}
***************
*** 1304,1310 **** WalSndGetStateString(WalSndState state)
  Datum
  pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  {
! #define PG_STAT_GET_WAL_SENDERS_COLS 	6
  	ReturnSetInfo	   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
  	TupleDesc			tupdesc;
  	Tuplestorestate	   *tupstore;
--- 1314,1320 ----
  Datum
  pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  {
! #define PG_STAT_GET_WAL_SENDERS_COLS 	7
  	ReturnSetInfo	   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
  	TupleDesc			tupdesc;
  	Tuplestorestate	   *tupstore;
***************
*** 1346,1351 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
--- 1356,1362 ----
  		XLogRecPtr	write;
  		XLogRecPtr	flush;
  		XLogRecPtr	apply;
+ 		int      	sync_priority;
  		WalSndState	state;
  		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
  		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS];
***************
*** 1361,1366 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
--- 1372,1381 ----
  		apply = walsnd->apply;
  		SpinLockRelease(&walsnd->mutex);
  
+ 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 		sync_priority = walsnd->sync_standby_priority;
+ 		LWLockRelease(SyncRepLock);
+ 
  		memset(nulls, 0, sizeof(nulls));
  		values[0] = Int32GetDatum(walsnd->pid);
  
***************
*** 1370,1380 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  			 * Only superusers can see details. Other users only get
  			 * the pid value to know it's a walsender, but no details.
  			 */
! 			nulls[1] = true;
! 			nulls[2] = true;
! 			nulls[3] = true;
! 			nulls[4] = true;
! 			nulls[5] = true;
  		}
  		else
  		{
--- 1385,1391 ----
  			 * Only superusers can see details. Other users only get
  			 * the pid value to know it's a walsender, but no details.
  			 */
! 			MemSet(&nulls[1], true, PG_STAT_GET_WAL_SENDERS_COLS - 1);
  		}
  		else
  		{
***************
*** 1401,1406 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
--- 1412,1419 ----
  			snprintf(location, sizeof(location), "%X/%X",
  					 apply.xlogid, apply.xrecoff);
  			values[5] = CStringGetTextDatum(location);
+ 
+ 			values[6] = Int32GetDatum(sync_priority);
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 39,44 ****
--- 39,45 ----
  #include "access/xact.h"
  #include "miscadmin.h"
  #include "postmaster/autovacuum.h"
+ #include "replication/syncrep.h"
  #include "storage/ipc.h"
  #include "storage/lmgr.h"
  #include "storage/pmsignal.h"
***************
*** 196,201 **** InitProcGlobal(void)
--- 197,203 ----
  		PGSemaphoreCreate(&(procs[i].sem));
  		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
  		ProcGlobal->freeProcs = &procs[i];
+ 		InitSharedLatch(&procs[i].waitLatch);
  	}
  
  	/*
***************
*** 214,219 **** InitProcGlobal(void)
--- 216,222 ----
  		PGSemaphoreCreate(&(procs[i].sem));
  		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
  		ProcGlobal->autovacFreeProcs = &procs[i];
+ 		InitSharedLatch(&procs[i].waitLatch);
  	}
  
  	/*
***************
*** 224,229 **** InitProcGlobal(void)
--- 227,233 ----
  	{
  		AuxiliaryProcs[i].pid = 0;		/* marks auxiliary proc as not in use */
  		PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
+ 		InitSharedLatch(&procs[i].waitLatch);
  	}
  
  	/* Create ProcStructLock spinlock, too */
***************
*** 326,331 **** InitProcess(void)
--- 330,341 ----
  		SHMQueueInit(&(MyProc->myProcLocks[i]));
  	MyProc->recoveryConflictPending = false;
  
+ 	/* Initialise the waitLSN for sync rep */
+ 	MyProc->waitLSN.xlogid = 0;
+ 	MyProc->waitLSN.xrecoff = 0;
+ 
+ 	OwnLatch((Latch *) &MyProc->waitLatch);
+ 
  	/*
  	 * We might be reusing a semaphore that belonged to a failed process. So
  	 * be careful and reinitialize its value here.	(This is not strictly
***************
*** 365,370 **** InitProcessPhase2(void)
--- 375,381 ----
  	/*
  	 * Arrange to clean that up at backend exit.
  	 */
+ 	on_shmem_exit(SyncRepCleanupAtProcExit, 0);
  	on_shmem_exit(RemoveProcFromArray, 0);
  }
  
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***************
*** 2861,2866 **** ProcessInterrupts(void)
--- 2861,2894 ----
  			ereport(FATAL,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("terminating autovacuum process due to administrator command")));
+ 		else if (WaitingForSyncRep)
+ 		{
+ 			/*
+ 			 * This must NOT be a FATAL message. We want the state of the
+ 			 * transaction being aborted to be indeterminate to ensure that
+ 			 * the transaction completion guarantee is never broken.
+ 			 */
+ 			ereport(WARNING,
+ 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+ 					 errmsg("terminating connection because fast shutdown is requested"),
+ 			errdetail("This connection requested synchronous replication at commit"
+ 					  " yet confirmation of replication has not been received."
+ 					  " The transaction has committed locally and might be committed"
+ 					  " on recently disconnected standby servers also.")));
+ 
+ 			/*
+ 			 * We DO NOT want to run proc_exit() callbacks -- we're here because
+ 			 * we are shutting down and don't want any code to stall or
+ 			 * prevent that.
+ 			 */
+ 			on_exit_reset();
+ 
+ 			/*
+ 			 * Note we do exit(0) not exit(>0). This is to avoid forcing
+ 			 * postmaster into a system reset cycle.
+ 			 */
+ 			exit(0);
+ 		}
  		else if (RecoveryConflictPending && RecoveryConflictRetryable)
  		{
  			pgstat_report_recovery_conflict(RecoveryConflictReason);
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 55,60 ****
--- 55,61 ----
  #include "postmaster/postmaster.h"
  #include "postmaster/syslogger.h"
  #include "postmaster/walwriter.h"
+ #include "replication/syncrep.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
  #include "storage/bufmgr.h"
***************
*** 754,759 **** static struct config_bool ConfigureNamesBool[] =
--- 755,768 ----
  		true, NULL, NULL
  	},
  	{
+ 		{"synchronous_replication", PGC_USERSET, WAL_REPLICATION,
+ 			gettext_noop("Requests synchronous replication."),
+ 			NULL
+ 		},
+ 		&sync_rep_mode,
+ 		false, NULL, NULL
+ 	},
+ 	{
  		{"zero_damaged_pages", PGC_SUSET, DEVELOPER_OPTIONS,
  			gettext_noop("Continues processing past damaged page headers."),
  			gettext_noop("Detection of a damaged page header normally causes PostgreSQL to "
***************
*** 2161,2166 **** static struct config_int ConfigureNamesInt[] =
--- 2170,2185 ----
  	},
  
  	{
+ 		{"sync_replication_timeout", PGC_USERSET, WAL_REPLICATION,
+ 			gettext_noop("Sets the maximum wait time for a response from synchronous replication."),
+ 			gettext_noop("A value of 0 turns off the timeout."),
+ 			GUC_UNIT_S
+ 		},
+ 		&sync_rep_timeout,
+ 		120, 0, INT_MAX, NULL, NULL
+ 	},
+ 
+ 	{
  		{"track_activity_query_size", PGC_POSTMASTER, RESOURCES_MEM,
  			gettext_noop("Sets the size reserved for pg_stat_activity.current_query, in bytes."),
  			NULL,
***************
*** 2717,2722 **** static struct config_string ConfigureNamesString[] =
--- 2736,2751 ----
  	},
  
  	{
+ 		{"synchronous_standby_names", PGC_SIGHUP, WAL_REPLICATION,
+ 			gettext_noop("List of potential standby names to synchronise with."),
+ 			NULL,
+ 			GUC_LIST_INPUT
+ 		},
+ 		&SyncRepStandbyNames,
+ 		"*", NULL, NULL
+ 	},
+ 
+ 	{
  		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
  			gettext_noop("Sets default text search configuration."),
  			NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 184,190 ****
  #archive_timeout = 0		# force a logfile segment switch after this
  				# number of seconds; 0 disables
  
! # - Streaming Replication -
  
  #max_wal_senders = 0		# max number of walsender processes
  				# (change requires restart)
--- 184,200 ----
  #archive_timeout = 0		# force a logfile segment switch after this
  				# number of seconds; 0 disables
  
! # - Replication - User Settings
! 
! #synchronous_replication = off		# does commit wait for reply from standby
! #replication_timeout_client = 120   # 0 means wait forever
! 
! # - Streaming Replication - Server Settings
! 
! #synchronous_standby_names = '*'	# standby servers that provide sync rep
! 				# comma-separated list of application_name from standby(s);
! 				# '*' = all (default)
! 
  
  #max_wal_senders = 0		# max number of walsender processes
  				# (change requires restart)
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 3078,3084 **** DATA(insert OID = 1936 (  pg_stat_get_backend_idset		PGNSP PGUID 12 1 100 0 f f
  DESCR("statistics: currently active backend IDs");
  DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,25,23}" "{i,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_hostname,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
  DESCR("statistics: information about currently active backends");
! DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25}" "{o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,replay_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
  DESCR("statistics: information about currently active replication");
  DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
  DESCR("statistics: current backend PID");
--- 3078,3084 ----
  DESCR("statistics: currently active backend IDs");
  DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,25,23}" "{i,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_hostname,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
  DESCR("statistics: information about currently active backends");
! DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25,23}" "{o,o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,replay_location,sync_priority}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
  DESCR("statistics: information about currently active replication");
  DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
  DESCR("statistics: current backend PID");
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
***************
*** 78,83 **** extern PGDLLIMPORT volatile uint32 CritSectionCount;
--- 78,86 ----
  /* in tcop/postgres.c */
  extern void ProcessInterrupts(void);
  
+ /* in replication/syncrep.c */
+ extern bool WaitingForSyncRep;
+ 
  #ifndef WIN32
  
  #define CHECK_FOR_INTERRUPTS() \
*** /dev/null
--- b/src/include/replication/syncrep.h
***************
*** 0 ****
--- 1,53 ----
+ /*-------------------------------------------------------------------------
+  *
+  * syncrep.h
+  *	  Exports from replication/syncrep.c.
+  *
+  * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+  *
+  * $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef _SYNCREP_H
+ #define _SYNCREP_H
+ 
+ #include "access/xlog.h"
+ #include "storage/proc.h"
+ #include "storage/shmem.h"
+ #include "storage/spin.h"
+ 
+ #define SyncRepRequested()				(sync_rep_mode)
+ 
+ /*
+  * Each synchronous rep queue lives in the WAL sender shmem area.
+  */
+ typedef struct SyncRepQueue
+ {
+ 	/*
+ 	 * Current location of the head of the queue. All waiters should have
+ 	 * a waitLSN that follows this value, or they are currently being woken
+ 	 * to remove themselves from the queue.
+ 	 */
+ 	XLogRecPtr	lsn;
+ 
+ 	PGPROC		*head;
+ 	PGPROC		*tail;
+ } SyncRepQueue;
+ 
+ /* user-settable parameters for synchronous replication */
+ extern bool sync_rep_mode;
+ extern int 	sync_rep_timeout;
+ extern char *SyncRepStandbyNames;
+ 
+ /* called by user backend */
+ extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+ 
+ /* callback at backend exit */
+ extern void SyncRepCleanupAtProcExit(int code, Datum arg);
+ 
+ /* called by wal sender */
+ extern void SyncRepInitConfig(void);
+ extern void SyncRepReleaseWaiters(void);
+ 
+ #endif   /* _SYNCREP_H */
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 15,20 ****
--- 15,21 ----
  #include "access/xlog.h"
  #include "nodes/nodes.h"
  #include "storage/latch.h"
+ #include "replication/syncrep.h"
  #include "storage/spin.h"
  
  
***************
*** 52,62 **** typedef struct WalSnd
--- 53,77 ----
  	 * to do.
  	 */
  	Latch		latch;
+ 
+ 	/*
+ 	 * The priority order of the standby managed by this WALSender, as
+ 	 * listed in synchronous_standby_names, or 0 if not-listed.
+ 	 * Protected by SyncRepLock.
+ 	 */
+ 	 int	sync_standby_priority;
  } WalSnd;
  
+ extern WalSnd *MyWalSnd;
+ 
  /* There is one WalSndCtl struct for the whole database cluster */
  typedef struct
  {
+ 	/*
+ 	 * Synchronous replication queue, protected by SyncRepLock.
+ 	 */
+ 	SyncRepQueue	sync_rep_queue;			/* Proc queue, sorted by LSN */
+ 
  	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
  } WalSndCtlData;
  
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 78,83 **** typedef enum LWLockId
--- 78,84 ----
  	SerializableFinishedListLock,
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
+ 	SyncRepLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 14,19 ****
--- 14,21 ----
  #ifndef _PROC_H_
  #define _PROC_H_
  
+ #include "access/xlog.h"
+ #include "storage/latch.h"
  #include "storage/lock.h"
  #include "storage/pg_sema.h"
  #include "utils/timestamp.h"
***************
*** 115,120 **** struct PGPROC
--- 117,126 ----
  	LOCKMASK	heldLocks;		/* bitmask for lock types already held on this
  								 * lock object by this backend */
  
+ 	/* Info to allow us to wait for synchronous replication, if needed. */
+ 	Latch		waitLatch;
+ 	XLogRecPtr	waitLSN;		/* waiting for this LSN or higher */
+ 
  	/*
  	 * All PROCLOCK objects for locks held or awaited by this backend are
  	 * linked into one of these lists, according to the partition number of
*** a/src/test/regress/expected/rules.out
--- b/src/test/regress/expected/rules.out
***************
*** 1298,1304 **** SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
   pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
   pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
   pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
!  pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
   pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
   pg_stat_sys_tables              | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
   pg_stat_user_functions          | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));
--- 1298,1304 ----
   pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
   pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
   pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
!  pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location, w.sync_priority FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location, sync_priority) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
   pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
   pg_stat_sys_tables              | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
   pg_stat_user_functions          | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#1)

Re: Sync Rep v19

On Thu, Mar 3, 2011 at 7:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Latest version of Sync Rep, which includes substantial internal changes
and simplifications from previous version. (25-30 changes).

Includes all outstanding technical comments, typos and docs. I will
continue to work on self review and test myself, though actively
encourage others to test and report issues.

Thanks for the patch!

* synchronous_standby_names = "*" matches all standby names

Using '*' as the default seems to lead the performance degradation by
being connected from unexpected synchronous standby.

* pg_stat_replication now shows standby priority - this is an ordinal
number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
standby".

monitoring.sgml should be updated.

Though I've not read whole of the patch yet, here is the current comment:

Using MyProc->lwWaiting and lwWaitLink for backends to wait for replication
looks fragile. Since they are used also by lwlock, the value of them can be
changed unexpectedly. Instead, how about defining dedicated variables for
replication?

+		else if (WaitingForSyncRep)
+		{
+			/*
+			 * This must NOT be a FATAL message. We want the state of the
+			 * transaction being aborted to be indeterminate to ensure that
+			 * the transaction completion guarantee is never broken.
+			 */

The backend can reach this code path after returning the commit to the client.
Instead, how about doing this in EndCommand, to close the connection before
returning the commit?

+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		sync_priority = walsnd->sync_standby_priority;
+		LWLockRelease(SyncRepLock);

LW_SHARE can be used here, instead.

+			/*
+			 * Wait no longer if we have already reached our LSN
+			 */
+			if (XLByteLE(XactCommitLSN, queue->lsn))
+			{
+				/* No need to wait */
+				LWLockRelease(SyncRepLock);
+				return;
+			}

It might take long to acquire SyncRepLock, so how about comparing
our LSN with WalSnd->flush before here?

replication_timeout_client depends on GetCurrentTransactionStopTimestamp().
In COMMIT case, it's OK. But In PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED cases, it seems problematic because they don't call
SetCurrentTransactionStopTimestamp().

In SyncRepWaitOnQueue, the backend can theoretically call WaitLatch() again
after the wake-up from the latch. In this case, the "timeout" should
be calculated
again. Otherwise, it would take unexpectedly very long to cause the timeout.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#2)

Re: Sync Rep v19

On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:

* synchronous_standby_names = "*" matches all standby names

Using '*' as the default seems to lead the performance degradation by
being connected from unexpected synchronous standby.

You can configure it however you wish. It seemed better to have an out
of the box setting that was useful.

* pg_stat_replication now shows standby priority - this is an ordinal
number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
standby".

monitoring.sgml should be updated.

Didn't think it needed to be, but I've added a few lines to explain.

Though I've not read whole of the patch yet, here is the current comment:

Using MyProc->lwWaiting and lwWaitLink for backends to wait for replication
looks fragile. Since they are used also by lwlock, the value of them can be
changed unexpectedly. Instead, how about defining dedicated variables for
replication?

Yes, I think the queue stuff needs a rewrite now.

+		else if (WaitingForSyncRep)
+		{
+			/*
+			 * This must NOT be a FATAL message. We want the state of the
+			 * transaction being aborted to be indeterminate to ensure that
+			 * the transaction completion guarantee is never broken.
+			 */
The backend can reach this code path after returning the commit to the client.
Instead, how about doing this in EndCommand, to close the connection before
returning the commit?

OK, will look.

+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		sync_priority = walsnd->sync_standby_priority;
+		LWLockRelease(SyncRepLock);

LW_SHARE can be used here, instead.

Seemed easier to keep it simple and have all lockers use LW_EXCLUSIVE.
But I've changed it for you.

+			/*
+			 * Wait no longer if we have already reached our LSN
+			 */
+			if (XLByteLE(XactCommitLSN, queue->lsn))
+			{
+				/* No need to wait */
+				LWLockRelease(SyncRepLock);
+				return;
+			}
It might take long to acquire SyncRepLock, so how about comparing
our LSN with WalSnd->flush before here?

If we're not the sync standby and we need to takeover the role of sync
standby we may need to issue a wakeup even though our standby reached
that LSN some time before. So we need to check each time.

replication_timeout_client depends on GetCurrentTransactionStopTimestamp().
In COMMIT case, it's OK. But In PREPARE TRANSACTION, COMMIT PREPARED
and ROLLBACK PREPARED cases, it seems problematic because they don't call
SetCurrentTransactionStopTimestamp().

Shame on them!

Seems reasonable that they should call
SetCurrentTransactionStopTimestamp().

I don't want to make a special case there for prepared transactions.

In SyncRepWaitOnQueue, the backend can theoretically call WaitLatch() again
after the wake-up from the latch. In this case, the "timeout" should
be calculated
again. Otherwise, it would take unexpectedly very long to cause the timeout.

That was originally modelled on on the way the statement_timeout timer
works. If it gets nudged and wakes up too early it puts itself back to
sleep to wakeup at the same time again.

I've renamed the variables to make that clearer and edited slightly.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Dimitri Fontaine

dimitri@2ndQuadrant.fr

almost 15 years ago

In reply to: Simon Riggs (#3)

Re: Sync Rep v19

Simon Riggs <simon@2ndQuadrant.com> writes:

On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:

* synchronous_standby_names = "*" matches all standby names

Using '*' as the default seems to lead the performance degradation by
being connected from unexpected synchronous standby.

You can configure it however you wish. It seemed better to have an out
of the box setting that was useful.

Well the HBA still needs some opening before anyone can claim to be a
standby. I guess the default line would be commented out and no standby
would be accepted as synchronous by default, assuming this GUC is sighup.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Dimitri Fontaine (#4)

Re: Sync Rep v19

On Thu, 2011-03-03 at 18:51 +0100, Dimitri Fontaine wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:

* synchronous_standby_names = "*" matches all standby names

Using '*' as the default seems to lead the performance degradation by
being connected from unexpected synchronous standby.

You can configure it however you wish. It seemed better to have an out
of the box setting that was useful.

Well the HBA still needs some opening before anyone can claim to be a
standby. I guess the default line would be commented out and no standby
would be accepted as synchronous by default, assuming this GUC is sighup.

The patch sets "*" as the default, so all standbys are synchronous by
default.

Would you prefer it if it was blank, meaning no standbys are
synchronous, by default?

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#5)

Re: Sync Rep v19

On Thu, Mar 3, 2011 at 1:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2011-03-03 at 18:51 +0100, Dimitri Fontaine wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:

* synchronous_standby_names = "*" matches all standby names

Using '*' as the default seems to lead the performance degradation by
being connected from unexpected synchronous standby.

You can configure it however you wish. It seemed better to have an out
of the box setting that was useful.

Well the HBA still needs some opening before anyone can claim to be a
standby. I guess the default line would be commented out and no standby
would be accepted as synchronous by default, assuming this GUC is sighup.

The patch sets "*" as the default, so all standbys are synchronous by
default.

Would you prefer it if it was blank, meaning no standbys are
synchronous, by default?

I think * is a reasonable default.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#2)

Re: Sync Rep v19

On Fri, 2011-03-04 at 00:02 +0900, Fujii Masao wrote:

+               else if (WaitingForSyncRep)
+               {
+                       /*
+                        * This must NOT be a FATAL message. We want
the state of the
+                        * transaction being aborted to be
indeterminate to ensure that
+                        * the transaction completion guarantee is
never broken.
+                        */
The backend can reach this code path after returning the commit to the
client.
Instead, how about doing this in EndCommand, to close the connection
before
returning the commit?

I don't really understand this comment.

You can't get there after returning the COMMIT message. Once we have
finished waiting we set WaitingForSyncRep = false, before we return to
RecordTransactionCommit() and continue from there.

Anyway, this is code in the interrupt handler and only gets executed
when we receive SIGTERM for a fast shutdown.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Yeb Havinga

yebhavinga@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#1)

Re: Sync Rep v19

On 2011-03-03 11:53, Simon Riggs wrote:

Latest version of Sync Rep, which includes substantial internal changes
and simplifications from previous version. (25-30 changes).

Includes all outstanding technical comments, typos and docs. I will
continue to work on self review and test myself, though actively
encourage others to test and report issues.

Interesting changes

* docs updated

* names listed in synchronous_standby_names are now in priority order

* synchronous_standby_names = "*" matches all standby names

* pg_stat_replication now shows standby priority - this is an ordinal
number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
standby".

Some initial remarks:

1) this works nice:
application_name not in synchronous_standby_names -> sync_priority = 0 (OK)
change synchronous_standby_names to default *, reload conf ->
sync_priority = 1 (OK)

message in log file
LOG: 00000: standby "walreceiver" is now the synchronous standby with
priority 1

2) priorities
I have to get used to mapping the integers to synchronous replication
meaning.
0 -> asynchronous
1 -> the synchronous standby that is waited for
2 and higher -> potential syncs

Could it be hidden from the user? I liked asynchronous / synchronous /
potential synchronous

then the log message could be
LOG: 00000: standby "walreceiver" is now the synchronous standby

3) walreceiver is the default application name - could there be problems
when a second standby with that name connects (ofcourse the same
question holds for two the same nondefault application_names)?

regards
Yeb Havinga

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Yeb Havinga (#8)

Re: Sync Rep v19

On Thu, 2011-03-03 at 22:27 +0100, Yeb Havinga wrote:

On 2011-03-03 11:53, Simon Riggs wrote:

Latest version of Sync Rep, which includes substantial internal changes
and simplifications from previous version. (25-30 changes).

Includes all outstanding technical comments, typos and docs. I will
continue to work on self review and test myself, though actively
encourage others to test and report issues.

Interesting changes

* docs updated

* names listed in synchronous_standby_names are now in priority order

* synchronous_standby_names = "*" matches all standby names

* pg_stat_replication now shows standby priority - this is an ordinal
number so "1" means 1st, "2" means 2nd etc, though 0 means "not a sync
standby".

Some initial remarks:

1) this works nice:
application_name not in synchronous_standby_names -> sync_priority = 0 (OK)
change synchronous_standby_names to default *, reload conf ->
sync_priority = 1 (OK)

message in log file
LOG: 00000: standby "walreceiver" is now the synchronous standby with
priority 1

2) priorities
I have to get used to mapping the integers to synchronous replication
meaning.
0 -> asynchronous
1 -> the synchronous standby that is waited for
2 and higher -> potential syncs

Could it be hidden from the user? I liked asynchronous / synchronous /
potential synchronous

Yes, that sounds good. I will leave it as it is now to gain other
comments since this need not delay commit.

then the log message could be
LOG: 00000: standby "walreceiver" is now the synchronous standby

The priority is mentioned in the LOG message, so you can understand what
happens when multiple standbys connect.

e.g.

if you have synchronous_standby_names = 'a, b, c'

and then the standbys connect in the order b, c, a then you will see log
messages

LOG: standby "b" is now the synchronous standby with priority 2
LOG: standby "a" is now the synchronous standby with priority 1

It's designed so no matter which order standbys arrive in it is the
highest priority standby that makes it to the front in the end.

3) walreceiver is the default application name - could there be problems
when a second standby with that name connects (ofcourse the same
question holds for two the same nondefault application_names)?

That's documented: in that case which standby is sync is indeterminate.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Simon Riggs (#7)

Re: Sync Rep v19

Simon Riggs <simon@2ndQuadrant.com> writes:

Anyway, this is code in the interrupt handler and only gets executed
when we receive SIGTERM for a fast shutdown.

I trust it's not getting *directly* executed from the interrupt handler,
at least not without ImmediateInterruptOK.

regards, tom lane

#11

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Tom Lane (#10)

Re: Sync Rep v19

On Fri, Mar 4, 2011 at 7:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Anyway, this is code in the interrupt handler and only gets executed
when we receive SIGTERM for a fast shutdown.

I trust it's not getting *directly* executed from the interrupt handler,
at least not without ImmediateInterruptOK.

Yes, the backend waits for replication while cancel/die interrupt is
being blocked, i.e., InterruptHoldoffCount > 0. So SIGTERM doesn't
lead the waiting backend to there directly. The backend reaches there
after returning the result.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#12

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#11)

Re: Sync Rep v19

On Fri, Mar 4, 2011 at 1:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Mar 4, 2011 at 7:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Anyway, this is code in the interrupt handler and only gets executed
when we receive SIGTERM for a fast shutdown.

I trust it's not getting *directly* executed from the interrupt handler,
at least not without ImmediateInterruptOK.

Yes, the backend waits for replication while cancel/die interrupt is
being blocked, i.e., InterruptHoldoffCount > 0. So SIGTERM doesn't
lead the waiting backend to there directly. The backend reaches there
after returning the result.

BTW, this is true in COMMIT and PREPARE cases, and false in
COMMIT PREPARED and ROLLBACK PREPARED cases. In the
latter cases, HOLD_INTERRUPT() is not called before waiting for
replication.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#13

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#12)

1 attachment(s)

Re: Sync Rep v19

On Fri, 2011-03-04 at 13:35 +0900, Fujii Masao wrote:

On Fri, Mar 4, 2011 at 1:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Mar 4, 2011 at 7:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Anyway, this is code in the interrupt handler and only gets executed
when we receive SIGTERM for a fast shutdown.

I trust it's not getting *directly* executed from the interrupt handler,
at least not without ImmediateInterruptOK.

Yes, the backend waits for replication while cancel/die interrupt is
being blocked, i.e., InterruptHoldoffCount > 0. So SIGTERM doesn't
lead the waiting backend to there directly. The backend reaches there
after returning the result.

BTW, this is true in COMMIT and PREPARE cases,

CommitTransaction() calls HOLD_INTERRUPT() and then RESUME_INTERRUPTS(),
which was reasonable before we started waiting for syncrep. The
interrupt does occur *before* we send the message back, but doesn't work
effectively at interrupting the wait in the way you would like.

If we RESUME_INTERRUPTS() prior to waiting and then HOLD again that
would allow all signals not just SIGTERM. We would need to selectively
reject everything except SIGTERM messages.

Ideas?

Alter ProcessInterrupts() to accept an interrupt if ProcDiePending &&
WaitingForSyncRep and InterruptHoldoffCount > 0. That looks a little
scary, but looks like it will work.

and false in
COMMIT PREPARED and ROLLBACK PREPARED cases. In the
latter cases, HOLD_INTERRUPT() is not called before waiting for
replication.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachments:

signal_filter.patchtext/x-patch; charset=UTF-8; name=signal_filter.patchDownload

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3063e0b..5d86deb 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2843,8 +2843,17 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 void
 ProcessInterrupts(void)
 {
-	/* OK to accept interrupt now? */
-	if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
+	/* 
+	 * OK to accept interrupt now?
+	 *
+	 * Normally this is very straightforward. We don't accept interrupts
+	 * between HOLD_INTERRUPTS() and RESUME_INTERRUPTS().
+	 *
+	 * For SyncRep, we want to accept SIGTERM signals while other interrupts
+	 * are held, so we have a special case solely when WaitingForSyncRep.
+	 */
+	if ((InterruptHoldoffCount != 0 || CritSectionCount != 0) &&
+		!(WaitingForSyncRep && ProcDiePending))
 		return;
 	InterruptPending = false;
 	if (ProcDiePending)

#14

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#2)

Re: Sync Rep v19

On Fri, Mar 4, 2011 at 12:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Though I've not read whole of the patch yet, here is the current comment:

Here are another comments:

+#replication_timeout_client = 120 # 0 means wait forever

Typo: s/replication_timeout_client/sync_replication_timeout

+			else if (timeout > 0 &&
+				TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+											wait_start, timeout))

If SetCurrentTransactionStopTimestamp() is called before (i.e., COMMIT case),
the return value of GetCurrentTransactionStopTimestamp() is the same as
"wait_start". So, in this case, the timeout never expires.

+				strcpy(new_status + len, " waiting for sync rep");
+				set_ps_display(new_status, false);

How about changing the message to something like "waiting for %X/%X"
(%X/%X indicates the LSN which the backend is waiting for)?

Please initialize MyProc->procWaitLink to NULL in InitProcess() as well as
do MyProc->lwWaitLink.

+	/*
+	 * We're a potential sync standby. Release waiters if we are the
+	 * highest priority standby. We do this even if the standby is not yet
+	 * caught up, in case this is a restart situation and
+	 * there are backends waiting for us. That allows backends to exit the
+	 * wait state even if new backends cannot yet enter the wait state.
+	 */

I don't think that it's good idea to switch the high priority standby which has
not caught up, to the sync one, especially when there is already another
sync standby. Because that degrades replication from sync to async for
a while, even though there is sync standby which has caught up.

+		if (walsnd->pid != 0 &&
+			walsnd->sync_standby_priority > 0 &&
+			(priority == 0 ||
+			 priority < walsnd->sync_standby_priority))
+		{
+			 priority = walsnd->sync_standby_priority;
+			 syncWalSnd = walsnd;
+		}

According to the code, the last named standby has highest priority. But the
document says the opposite.

ISTM the waiting backends can be sent the wake-up signal by the
walsender multiple times since the walsender doesn't remove any
entry from the queue. Isn't this unsafe? waste of the cycle?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#15

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#13)

Re: Sync Rep v19

On Fri, Mar 4, 2011 at 3:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

CommitTransaction() calls HOLD_INTERRUPT() and then RESUME_INTERRUPTS(),
which was reasonable before we started waiting for syncrep. The
interrupt does occur *before* we send the message back, but doesn't work
effectively at interrupting the wait in the way you would like.

If we RESUME_INTERRUPTS() prior to waiting and then HOLD again that
would allow all signals not just SIGTERM. We would need to selectively
reject everything except SIGTERM messages.

Ideas?

Alter ProcessInterrupts() to accept an interrupt if ProcDiePending &&
WaitingForSyncRep and InterruptHoldoffCount > 0. That looks a little
scary, but looks like it will work.

If shutdown is requested before WaitingForSyncRep is set to TRUE and
after HOLD_INTERRUPT() is called, the waiting backends cannot be
interrupted.

SIGTERM can be sent by pg_terminate_backend(). So we should check
whether shutdown is requested before emitting WARNING and closing
the connection. If it's not requested yet, I think that it's safe to return the
success indication to the client.

I think that it's safer to close the connection and terminate the backend
after cleaning all the resources. So, as I suggested before, how about
doing that in EndCommand()?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#16

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#15)

Re: Sync Rep v19

On Fri, 2011-03-04 at 17:34 +0900, Fujii Masao wrote:

On Fri, Mar 4, 2011 at 3:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

CommitTransaction() calls HOLD_INTERRUPT() and then RESUME_INTERRUPTS(),
which was reasonable before we started waiting for syncrep. The
interrupt does occur *before* we send the message back, but doesn't work
effectively at interrupting the wait in the way you would like.

If we RESUME_INTERRUPTS() prior to waiting and then HOLD again that
would allow all signals not just SIGTERM. We would need to selectively
reject everything except SIGTERM messages.

Ideas?

Alter ProcessInterrupts() to accept an interrupt if ProcDiePending &&
WaitingForSyncRep and InterruptHoldoffCount > 0. That looks a little
scary, but looks like it will work.

If shutdown is requested before WaitingForSyncRep is set to TRUE and
after HOLD_INTERRUPT() is called, the waiting backends cannot be
interrupted.

SIGTERM can be sent by pg_terminate_backend(). So we should check
whether shutdown is requested before emitting WARNING and closing
the connection. If it's not requested yet, I think that it's safe to return the
success indication to the client.

I'm not sure if that matters. Nobody apart from the postmaster knows
about a shutdown. All the other processes know is that they received
SIGTERM, which as you say could have been a specific user action aimed
at an individual process.

We need a way to end the wait state explicitly, so it seems easier to
make SIGTERM the initiating action, no matter how it is received.

The alternative is to handle it this way
1) set something in shared memory
2) set latch of all backends
3) have the backends read shared memory and then end the wait

Who would do (1) and (2)? Not the backend, its sleeping, not the
postmaster its shm, nor a WALSender cos it might not be there.

Seems like a lot of effort to avoid SIGTERM. Do we have a good reason
why we need that? Might it introduce other issues?

I think that it's safer to close the connection and terminate the backend
after cleaning all the resources. So, as I suggested before, how about
doing that in EndCommand()?

Yes, if we don't use SIGTERM then we would use EndCommand()

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#17

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#14)

Re: Sync Rep v19

On Fri, 2011-03-04 at 16:42 +0900, Fujii Masao wrote:

On Fri, Mar 4, 2011 at 12:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Though I've not read whole of the patch yet, here is the current comment:

Here are another comments:

+#replication_timeout_client = 120 # 0 means wait forever

Typo: s/replication_timeout_client/sync_replication_timeout

Done

+			else if (timeout > 0 &&
+				TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+											wait_start, timeout))
If SetCurrentTransactionStopTimestamp() is called before (i.e., COMMIT case),
the return value of GetCurrentTransactionStopTimestamp() is the same as
"wait_start". So, in this case, the timeout never expires.

Don't understand (still)

+				strcpy(new_status + len, " waiting for sync rep");
+				set_ps_display(new_status, false);
How about changing the message to something like "waiting for %X/%X"
(%X/%X indicates the LSN which the backend is waiting for)?

Done

Please initialize MyProc->procWaitLink to NULL in InitProcess() as well as
do MyProc->lwWaitLink.

I'm rewriting that aspect now.

+	/*
+	 * We're a potential sync standby. Release waiters if we are the
+	 * highest priority standby. We do this even if the standby is not yet
+	 * caught up, in case this is a restart situation and
+	 * there are backends waiting for us. That allows backends to exit the
+	 * wait state even if new backends cannot yet enter the wait state.
+	 */
I don't think that it's good idea to switch the high priority standby which has
not caught up, to the sync one, especially when there is already another
sync standby. Because that degrades replication from sync to async for
a while, even though there is sync standby which has caught up.

OK, that wasn't really my intention. Changed.

+		if (walsnd->pid != 0 &&
+			walsnd->sync_standby_priority > 0 &&
+			(priority == 0 ||
+			 priority < walsnd->sync_standby_priority))
+		{
+			 priority = walsnd->sync_standby_priority;
+			 syncWalSnd = walsnd;
+		}
According to the code, the last named standby has highest priority. But the
document says the opposite.

Priority is a difficult word here since "1" is the highest priority. I
deliberately avoided using the word "highest" in the code for that
reason.

The code above finds the lowest non-zero standby, which is correct as
documented.

ISTM the waiting backends can be sent the wake-up signal by the
walsender multiple times since the walsender doesn't remove any
entry from the queue. Isn't this unsafe? waste of the cycle?

It's ok to set a latch that isn't set. It's unlikely to wake someone
twice before they can remove themselves.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#18

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Simon Riggs (#17)

Re: Sync Rep v19

On Fri, 2011-03-04 at 10:51 +0000, Simon Riggs wrote:

+			else if (timeout > 0 &&
+				TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+											wait_start, timeout))
If SetCurrentTransactionStopTimestamp() is called before (i.e., COMMIT case),
the return value of GetCurrentTransactionStopTimestamp() is the same as
"wait_start". So, in this case, the timeout never expires.
Don't understand (still)

OK, coffee has seeped into brain now, thanks.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#19

Yeb Havinga

yebhavinga@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#1)

Re: Sync Rep v19

On 2011-03-03 11:53, Simon Riggs wrote:

Latest version of Sync Rep, which includes substantial internal changes
and simplifications from previous version. (25-30 changes).

Testing more with the post v19 version from github with HEAD

commit 009875662e1b47012e1f4b7d30eb9e238d1937f6
Author: Simon Riggs <simon@2ndquadrant.com>
Date: Fri Mar 4 06:13:43 2011 +0000

Allow SIGTERM messages in ProcessInterrupts() even when interrupts are
held, if WaitingForSyncRep

1) unexpected behaviour
- master has synchronous_standby_names = 'standby1,standby2,standby3'
- standby with 'standby2' connects first.
- LOG: 00000: standby "standby2" is now the synchronous standby with
priority 2

I'm still confused by the priority numbers. At first I thought that
priority 1 meant: this is the one that is currently waited for. Now I'm
not sure if this is the first potential standby that is not used, or
that it is actually the one waited for.
What I expected was that it would be connected with priority 1. And then
if the standby1 connect, it would become the one with prio1 and standby2
with prio2.

2) unexpected behaviour
- continued from above
- standby with 'asyncone' name connects next
- no log message on master

I expected a log message along the lines 'standby "asyncone" is now an
asynchronous standby'

3) more about log messages
- didn't get a log message that the asyncone standby stopped
- didn't get a log message that standby1 connected with priority 1
- after stop / start master, again only got a log that standby2
connectied with priority 2
- pg_stat_replication showed both standb1 and standby2 with correct prio#

4) More about the priority stuff. At this point I figured out prio 2 can
also be 'the real sync'. Still I'd prefer in pg_stat_replication a
boolean that clearly shows 'this is the one', with a source that is
intimately connected to the syncrep implemenation, instead of a
different implementation of 'if lowest connected priority and > 0, then
sync is true. If there are two different implementations, there is room
for differences, which doesn't feel right.

5) performance.
Seems to have dropped a a few dozen %. With v17 I earlier got ~650 tps
and after some more tuning over 900 tps. Now with roughly the same setup
I get ~ 550 tps. Both versions on the same hardware, both compiled
without debugging, and I used the same postgresql.conf start config.

I'm currently thinking about a failure test that would check if a commit
has really waited for the standby. What's the worst thing to do to a
master server? Ideas are welcome :-)

#!/bin/sh
psql -c "create a big table with generate_series"
echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

regards,
Yeb Havinga

#20

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Yeb Havinga (#19)

Re: Sync Rep v19

On Fri, 2011-03-04 at 12:24 +0100, Yeb Havinga wrote:

On 2011-03-03 11:53, Simon Riggs wrote:

Latest version of Sync Rep, which includes substantial internal changes
and simplifications from previous version. (25-30 changes).

Testing more with the post v19 version from github with HEAD

Thanks

commit 009875662e1b47012e1f4b7d30eb9e238d1937f6
Author: Simon Riggs <simon@2ndquadrant.com>
Date: Fri Mar 4 06:13:43 2011 +0000

Allow SIGTERM messages in ProcessInterrupts() even when interrupts are
held, if WaitingForSyncRep

1) unexpected behaviour
- master has synchronous_standby_names = 'standby1,standby2,standby3'
- standby with 'standby2' connects first.
- LOG: 00000: standby "standby2" is now the synchronous standby with
priority 2

I'm still confused by the priority numbers. At first I thought that
priority 1 meant: this is the one that is currently waited for. Now I'm
not sure if this is the first potential standby that is not used, or
that it is actually the one waited for.
What I expected was that it would be connected with priority 1. And then
if the standby1 connect, it would become the one with prio1 and standby2
with prio2.

The priority refers to the order in which that standby is listed in
synchronous_standby_names. That is not dependent upon who is currently
connected. It doesn't mean the order in which the currently connected
standbys will become the sync standby.

So the log message allows you to work out that "standby2" is connected
and will operate as sync standby until something mentioned earlier in
synchronous_standby_names, in this case standby1, connects.

2) unexpected behaviour
- continued from above
- standby with 'asyncone' name connects next
- no log message on master

I expected a log message along the lines 'standby "asyncone" is now an
asynchronous standby'

That would introduce messages where there currently aren't any, so I
left that out. I'll put it in for clarity.

3) more about log messages
- didn't get a log message that the asyncone standby stopped

- didn't get a log message that standby1 connected with priority 1

Bad

- after stop / start master, again only got a log that standby2
connectied with priority 2

Bad

- pg_stat_replication showed both standb1 and standby2 with correct prio#

Good

Please send me log output at DEBUG3 offline.

4) More about the priority stuff. At this point I figured out prio 2 can
also be 'the real sync'. Still I'd prefer in pg_stat_replication a
boolean that clearly shows 'this is the one', with a source that is
intimately connected to the syncrep implemenation, instead of a
different implementation of 'if lowest connected priority and > 0, then
sync is true. If there are two different implementations, there is room
for differences, which doesn't feel right.

5) performance.
Seems to have dropped a a few dozen %. With v17 I earlier got ~650 tps
and after some more tuning over 900 tps. Now with roughly the same setup
I get ~ 550 tps. Both versions on the same hardware, both compiled
without debugging, and I used the same postgresql.conf start config.

Will need to re-look at performance after commit

I'm currently thinking about a failure test that would check if a commit
has really waited for the standby. What's the worst thing to do to a
master server? Ideas are welcome :-)

#!/bin/sh
psql -c "create a big table with generate_series"
echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

regards,
Yeb Havinga

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#21

Yeb Havinga

yebhavinga@gmail.com

almost 15 years ago

In reply to: Yeb Havinga (#19)

Re: Sync Rep v19

On 2011-03-04 12:24, Yeb Havinga wrote:

I'm currently thinking about a failure test that would check if a
commit has really waited for the standby. What's the worst thing to do
to a master server? Ideas are welcome :-)

#!/bin/sh
psql -c "create a big table with generate_series"
echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger

Did that with both a sync and async standby server, then promoted both
replicas.

Both replicas had the complete big table. Maybe the async server was
somehow 'saved' by the master waiting for the sync server? Test repeated
with only the async one connected.

The master then shows this at restart
LOG: 00000: record with zero length at 4/B2CD3598
LOG: 00000: redo done at 4/B2CD3558
LOG: 00000: last completed transaction was at log time 2011-03-04
14:43:31.02041+01

The async promoted server
LOG: 00000: record with zero length at 4/B2CC9260
LOG: 00000: redo done at 4/B2CC9220
LOG: 00000: last completed transaction was at log time 2011-03-04
14:43:31.018444+01

Even though the async server had the complete relation I created,
something was apparently done just before the reboot.

Test repeated with only 1 sync standby

Then on master at recovery
LOG: 00000: record with zero length at 4/D1051C88
LOG: 00000: redo done at 4/D1051C48
LOG: 00000: last completed transaction was at log time 2011-03-04
14:52:11.035188+01

on the sync promoted server
LOG: 00000: redo done at 4/D1051C48
LOG: 00000: last completed transaction was at log time 2011-03-04
14:52:11.035188+01

Nice!

regards,
Yeb Havinga

#22

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#17)

Re: Sync Rep v19

On Fri, Mar 4, 2011 at 7:51 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

+             if (walsnd->pid != 0 &&
+                     walsnd->sync_standby_priority > 0 &&
+                     (priority == 0 ||
+                      priority < walsnd->sync_standby_priority))
+             {
+                      priority = walsnd->sync_standby_priority;
+                      syncWalSnd = walsnd;
+             }
According to the code, the last named standby has highest priority. But the
document says the opposite.
Priority is a difficult word here since "1" is the highest priority. I
deliberately avoided using the word "highest" in the code for that
reason.

The code above finds the lowest non-zero standby, which is correct as
documented.

Hmm.. that seems to find the highest standby. And, I could confirm
that in my box. Please see the following. The priority (= 2) of
synchronous standby (its sync_state is SYNC) is higher than that (= 1)
of potential one (its sync_state is POTENTIAL).

postgres=# SHOW synchronous_standby_names ;
synchronous_standby_names
---------------------------
one, two
(1 row)

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#23

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#22)

Re: Sync Rep v19

On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:

postgres=# SELECT application_name, state, sync_priority, sync_state
FROM pg_stat_replication;
application_name | state | sync_priority | sync_state
------------------+-----------+---------------+------------
one | STREAMING | 1 | POTENTIAL
two | STREAMING | 2 | SYNC
(2 rows)

Bug! Thanks.

Fixed

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#24

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#16)

Re: Sync Rep v19

On Fri, Mar 4, 2011 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

SIGTERM can be sent by pg_terminate_backend(). So we should check
whether shutdown is requested before emitting WARNING and closing
the connection. If it's not requested yet, I think that it's safe to return the
success indication to the client.

I'm not sure if that matters. Nobody apart from the postmaster knows
about a shutdown. All the other processes know is that they received
SIGTERM, which as you say could have been a specific user action aimed
at an individual process.

We need a way to end the wait state explicitly, so it seems easier to
make SIGTERM the initiating action, no matter how it is received.

The alternative is to handle it this way
1) set something in shared memory
2) set latch of all backends
3) have the backends read shared memory and then end the wait

Who would do (1) and (2)? Not the backend, its sleeping, not the
postmaster its shm, nor a WALSender cos it might not be there.

Seems like a lot of effort to avoid SIGTERM. Do we have a good reason
why we need that? Might it introduce other issues?

On the second thought...

I was totally wrong. Preventing the backend from returning the commit
when shutdown is requested doesn't help to avoid the data loss at all.
Without shutdown, the following simple scenario can cause data loss.

1. Replication connection is closed because of network outage.
2. Though replication has not been completed, the waiting backend is
released since the timeout expires. Then it returns the success to
the client.
3. The primary crashes, and then the clusterware promotes the standby
which doesn't have the latest change on the primary to new primary.
Data lost happens!

In the first place, there are two kinds of data loss:

(A) Pysical data loss
This is the case where we can never retrieve the committed data
physically. For example, if the storage of the standalone server gets
corrupted, we would lose some data forever. To avoid this type of
data loss, we would have to choose the "wait-forever" behavior. But
as I said in upthread, we can decrease the risk of this data loss to
a certain extent by spending much money on the storage. So, if that
cost is less than the cost which we have to pay when down-time
happens, we don't need to choose the "wait-forever" option.

(B) Logical data loss
This is the case where we think wrongly that the committed data
has been lost while we can actually retrieve it physically. For example,
in the above three-steps scenario, we can read all the committed data
from two servers physically even after failover. But since the client
attempts to read data only from new primary, some data looks lost to
the client. The "wait-forever" behavior can help also to avoid this type
of data loss. And, another way is to STONITH the standby before the
timeout releases any waiting backend. If so, we can completely prevent
the outdated standby from being brought up, and can avoid logical data
loss. According to my quick research, in DRBD, the "dopd (DRBD
outdate-peer daemon)" plays that role.

What I'd like to avoid is (B). Though (A) is more serious problem than (B),
we already have some techniques to decrease the risk of (A). But not
(B), I think.

The "wait-forever" might be a straightforward approach against (B). But
this option prevents transactions from running not only when the
synchronous standby goes away, but also when the primary is invoked
first or when the standby is promoted at failover. Since the availability
of the database service decreases very much, I don't want to use that.

Keeping transactions waiting in the latter two cases would be required
to avoid (A), but not (B). So I think that we can relax the "wait-forever"
option so that it allows not-replicated transactions to complete only in
those cases. IOW, when we initially start the primary, the backends
don't wait at all for new standby to connect. And, while new primary is
running alone after failover, the backends don't wait at all, too. Only
when replication connection is closed while streaming WAL to sync
standby, the backends wait until new sync standby has connected and
replication has been completed. Even in this case, if we want to
improve the service availability, we have only to make something like
dopd to STONITH the outdated standby, and then request the primary
to release the waiting backends. So I think that the interface to
request that release should be implemented.

Fortunately, that partial "wait-forever" behavior has already been
implemented in Simon's patch with the client timeout = 0 (disable).
If he implements the interface to release the waiting backends,
I'm OK with his design about when to release the backends for 9.1
(unless I'm missing something).

Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#25

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#24)

Re: Sync Rep v19

On Fri, Mar 4, 2011 at 3:04 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

The "wait-forever" might be a straightforward approach against (B). But
this option prevents transactions from running not only when the
synchronous standby goes away, but also when the primary is invoked
first or when the standby is promoted at failover. Since the availability
of the database service decreases very much, I don't want to use that.

I continue to think that wait-forever is the most sensible option. If
you want all of your data on the disks of two machines before the
commit is ack'd, I think you probably want that all the time. The
second scenario you mentioned ("when the standby is promoted at
failover") is quite easy to handle. If you don't want synchronous
replication after a standby promotion, then configure the master to do
synchronous replication and the slave not to do synchronous
replication. Similarly, if you've got an existing machine that is not
doing synchronous replication and you want to start, fire up the
standby in asynchronous mode and switch to synchronous replication
after it has fully caught up. It seems to me that we're bent on
providing a service that does synchronous replication except when it
first starts up or when the timeout expires or when the phase of the
moon is waxing gibbons, and I don't get the point of that. If I ask
for synchronous replication, I want it to be synchronous until I
explicitly turn it off. Otherwise, when I fail over, how do I know if
I've got all my transactions, or not?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26

Jaime Casanova

jaime@2ndquadrant.com

almost 15 years ago

In reply to: Robert Haas (#6)

Re: Sync Rep v19

On Thu, Mar 3, 2011 at 1:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The patch sets "*" as the default, so all standbys are synchronous by
default.

Would you prefer it if it was blank, meaning no standbys are
synchronous, by default?

I think * is a reasonable default.

Actually i would prefer to have standbys asynchronous by default...
though is true that there will be no waits until i set
synchronous_replication to on... 1) it could be confusing to see a
SYNC standby in pg_stat_replication by default when i wanted all of
them to be async, 2) also * will give priority 1 to all standbys so it
doesn't seem like a very useful out-of-the-box configuration, better
to make the dba to write the standby names in the order they want

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

#27

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Jaime Casanova (#26)

Re: Sync Rep v19

On Fri, Mar 4, 2011 at 4:18 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Thu, Mar 3, 2011 at 1:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The patch sets "*" as the default, so all standbys are synchronous by
default.

Would you prefer it if it was blank, meaning no standbys are
synchronous, by default?

I think * is a reasonable default.

Actually i would prefer to have standbys asynchronous by default...
though is true that there will be no waits until i set
synchronous_replication to on... 1) it could be confusing to see a
SYNC standby in pg_stat_replication by default when i wanted all of
them to be async, 2) also * will give priority 1 to all standbys so it
doesn't seem like a very useful out-of-the-box configuration, better
to make the dba to write the standby names in the order they want

Mmm, good points.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28

Yeb Havinga

yebhavinga@gmail.com

almost 15 years ago

In reply to: Jaime Casanova (#26)

Re: Sync Rep v19

On 2011-03-04 22:18, Jaime Casanova wrote:

On Thu, Mar 3, 2011 at 1:23 PM, Robert Haas<robertmhaas@gmail.com> wrote:

The patch sets "*" as the default, so all standbys are synchronous by
default.

Would you prefer it if it was blank, meaning no standbys are
synchronous, by default?

I think * is a reasonable default.

Actually i would prefer to have standbys asynchronous by default...
though is true that there will be no waits until i set
synchronous_replication to on... 1) it could be confusing to see a
SYNC standby in pg_stat_replication by default when i wanted all of
them to be async,

I see no problem with * for synchronous_standby names, such that *if*
synchronous_replication = on, then all standbys are sync. Also for the
beginning experimenter with sync rep: what would you expect after only
turning 'synchronous_replication' = on? ISTM better than : you need to
change two parameters from their default to get a replica in sync mode.

2) also * will give priority 1 to all standbys so it
doesn't seem like a very useful out-of-the-box configuration, better
to make the dba to write the standby names in the order they want

As somebody with a usecase for two hardware-wise equal sync replicas for
the same master (and a single async replica), the whole ordering of sync
standbys is too much feature anyway, since it will cause unneccesary
'which is the sync replica' switching. Besides that, letting all syncs
have the same priority sounds like the only thing the server can do, if
the dba has not specified it explicitly. I would see it as improvement
if order in standby_names doesn't mean priority, and that priority could
be specified with another parameter (and default: all sync priorities
the same)

regards,
Yeb Havinga

#29

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#24)

2 attachment(s)

Re: Sync Rep v19

On Sat, 2011-03-05 at 05:04 +0900, Fujii Masao wrote:

On Fri, Mar 4, 2011 at 7:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

SIGTERM can be sent by pg_terminate_backend(). So we should check
whether shutdown is requested before emitting WARNING and closing
the connection. If it's not requested yet, I think that it's safe to return the
success indication to the client.

I'm not sure if that matters. Nobody apart from the postmaster knows
about a shutdown. All the other processes know is that they received
SIGTERM, which as you say could have been a specific user action aimed
at an individual process.

We need a way to end the wait state explicitly, so it seems easier to
make SIGTERM the initiating action, no matter how it is received.

The alternative is to handle it this way
1) set something in shared memory
2) set latch of all backends
3) have the backends read shared memory and then end the wait

Who would do (1) and (2)? Not the backend, its sleeping, not the
postmaster its shm, nor a WALSender cos it might not be there.

Seems like a lot of effort to avoid SIGTERM. Do we have a good reason
why we need that? Might it introduce other issues?

On the second thought...

I was totally wrong. Preventing the backend from returning the commit
when shutdown is requested doesn't help to avoid the data loss at all.
Without shutdown, the following simple scenario can cause data loss.

1. Replication connection is closed because of network outage.
2. Though replication has not been completed, the waiting backend is
released since the timeout expires. Then it returns the success to
the client.
3. The primary crashes, and then the clusterware promotes the standby
which doesn't have the latest change on the primary to new primary.
Data lost happens!

Yes, that can happen. As people will no doubt observe, this seems to be
an argument for wait-forever. What we actually need is a wait that lasts
longer than it takes for us to decide to failover, if the standby is
actually up and this is some kind of split brain situation. That way the
clients are still waiting when failover occurs. WAL is missing, but
since we didn't acknowledge the client we are OK to treat that situation
as if it were an abort.

In the first place, there are two kinds of data loss:

(A) Pysical data loss
This is the case where we can never retrieve the committed data
physically. For example, if the storage of the standalone server gets
corrupted, we would lose some data forever. To avoid this type of
data loss, we would have to choose the "wait-forever" behavior. But
as I said in upthread, we can decrease the risk of this data loss to
a certain extent by spending much money on the storage. So, if that
cost is less than the cost which we have to pay when down-time
happens, we don't need to choose the "wait-forever" option.

(B) Logical data loss
This is the case where we think wrongly that the committed data
has been lost while we can actually retrieve it physically. For example,
in the above three-steps scenario, we can read all the committed data
from two servers physically even after failover. But since the client
attempts to read data only from new primary, some data looks lost to
the client. The "wait-forever" behavior can help also to avoid this type
of data loss. And, another way is to STONITH the standby before the
timeout releases any waiting backend. If so, we can completely prevent
the outdated standby from being brought up, and can avoid logical data
loss. According to my quick research, in DRBD, the "dopd (DRBD
outdate-peer daemon)" plays that role.

What I'd like to avoid is (B). Though (A) is more serious problem than (B),
we already have some techniques to decrease the risk of (A). But not
(B), I think.

The "wait-forever" might be a straightforward approach against (B). But
this option prevents transactions from running not only when the
synchronous standby goes away, but also when the primary is invoked
first or when the standby is promoted at failover. Since the availability
of the database service decreases very much, I don't want to use that.

Keeping transactions waiting in the latter two cases would be required
to avoid (A), but not (B). So I think that we can relax the "wait-forever"
option so that it allows not-replicated transactions to complete only in
those cases. IOW, when we initially start the primary, the backends
don't wait at all for new standby to connect. And, while new primary is
running alone after failover, the backends don't wait at all, too. Only
when replication connection is closed while streaming WAL to sync
standby, the backends wait until new sync standby has connected and
replication has been completed. Even in this case, if we want to
improve the service availability, we have only to make something like
dopd to STONITH the outdated standby, and then request the primary
to release the waiting backends. So I think that the interface to
request that release should be implemented.

Fortunately, that partial "wait-forever" behavior has already been
implemented in Simon's patch with the client timeout = 0 (disable).
If he implements the interface to release the waiting backends,
I'm OK with his design about when to release the backends for 9.1
(unless I'm missing something).

Almost-working patch attached for the above feature. Time to stop for
the day. Patch against current repo version.

Current repo version attached here also (v20), which includes all fixes
to all known technical issues, major polishing etc..

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachments:

shutdown_without_completion.v1.patchtext/x-patch; charset=UTF-8; name=shutdown_without_completion.v1.patchDownload

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac82ebb..c6e3093 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -67,7 +67,7 @@ bool	sync_rep_mode = false;			/* Only set in user backends */
 int		sync_rep_timeout = 120;			/* Only set in user backends */
 char 	*SyncRepStandbyNames;
 
-bool	WaitingForSyncRep = false;	/* Global state for some exit methods */
+bool	ExitSilentlyAtEndCommand = false;	/* Global state for some exit methods */
 
 static bool announce_next_takeover = true;
 
@@ -105,7 +105,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 void
 SyncRepCleanupAtProcExit(int code, Datum arg)
 {
-	if (WaitingForSyncRep && !SHMQueueIsDetached(&(MyProc->syncrep_links)))
+	if (!SHMQueueIsDetached(&(MyProc->syncrep_links)))
 	{
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 		SHMQueueDelete(&(MyProc->syncrep_links));
@@ -194,7 +194,6 @@ SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
 			MyProc->waitLSN = XactCommitLSN;
 			SHMQueueInsertBefore(&(WalSndCtl->SyncRepQueue), &(MyProc->syncrep_links));
 			LWLockRelease(SyncRepLock);
-			WaitingForSyncRep = true;
 
 			/*
 			 * Alter ps display to show waiting for sync rep.
@@ -222,7 +221,7 @@ SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
 			 * unlikely to be far enough, yet is possible. Next time we are
 			 * woken we should be more lucky.
 			 */
-			if (XLByteLE(XactCommitLSN, walsndctl->lsn))
+			if (XLByteLE(XactCommitLSN, walsndctl->lsn) || ProcDiePending)
 				release = true;
 			else if (timeout > 0 &&
 				TimestampDifferenceExceeds(wait_start, now, timeout))
@@ -235,7 +234,6 @@ SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
 			{
 				SHMQueueDelete(&(MyProc->syncrep_links));
 				LWLockRelease(SyncRepLock);
-				WaitingForSyncRep = false;
 
 				/*
 				 * Reset our waitLSN.
@@ -251,17 +249,39 @@ SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
 				}
 
 				/*
-				 * Our response to the timeout is to simply post a NOTICE and
-				 * then return to the user. The commit has happened, we just
-				 * haven't been able to verify it has been replicated in the
-				 * way requested.
+				 * Our response to the timeout is to simply post a NOTICE
+				 * and then return to the user. The commit has happened, we
+				 * just haven't been able to verify it has been replicated
+				 * in the way requested.
 				 */
 				if (timed_out)
+				{
 					ereport(NOTICE,
-							(errmsg("synchronous replication wait for %X/%X timeout at %s",
+						(errmsg("timeout of synchronous replication wait for %X/%X",
 										XactCommitLSN.xlogid,
-										XactCommitLSN.xrecoff,
-										timestamptz_to_str(now))));
+										XactCommitLSN.xrecoff),
+						errdetail("Synchronous replication was requested"
+							" yet confirmation of replication has not been"
+							" received from the specified standby server."
+							" The transaction has committed locally and might"
+							" also be committed on some recently disconnected"
+							" standby servers.")));
+				}
+				else if (ProcDiePending)
+				{
+					/*
+					 * Cancel the interrupt state and return to normal
+					 * processing until we hit EndCommand(), where we
+					 * exit without returning anything to client.
+					 * This ensures that we don't break our guarantee that
+					 * a commit message means the data is safe. Note that
+					 * the guarantee doesn't work the other way around,
+					 * the absence of a commit message doesn't mean it
+					 * didn't commit.
+					 */
+					ProcDiePending = false;
+					ExitSilentlyAtEndCommand = true;
+				}
 				else
 					ereport(DEBUG3,
 							(errmsg("synchronous replication wait for %X/%X complete at %s",
@@ -274,7 +294,13 @@ SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
 			LWLockRelease(SyncRepLock);
 		}
 
-		WaitLatch(&MyProc->waitLatch, timeout);
+		/*
+		 * If we've received a signal to shutdown or a specific signal to
+		 * terminate this backend, don't wait, just loop straight back
+		 * round to remove ourselves from the queue and go.
+		 */
+		if (!ProcDiePending)
+			WaitLatch(&MyProc->waitLatch, timeout);
 	}
 }
 
diff --git a/src/backend/tcop/dest.c b/src/backend/tcop/dest.c
index 24af3fb..2e8e17f 100644
--- a/src/backend/tcop/dest.c
+++ b/src/backend/tcop/dest.c
@@ -36,6 +36,8 @@
 #include "executor/tstoreReceiver.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
+#include "miscadmin.h"
+#include "storage/ipc.h"
 #include "utils/portal.h"
 
 
@@ -143,6 +145,39 @@ EndCommand(const char *commandTag, CommandDest dest)
 		case DestRemote:
 		case DestRemoteExecute:
 
+			if (ExitSilentlyAtEndCommand)
+			{
+				/*
+				 * This must NOT be a FATAL message. We want the state of the
+				 * transaction being aborted to be indeterminate to ensure that
+				 * the transaction completion guarantee is never broken.
+				 */
+				ereport(WARNING,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 	errmsg("terminating connection because of fast shutdown"
+						" while waiting for synchronous replication"),
+					errdetail("Synchronous replication was requested"
+						" yet confirmation of replication has not been"
+						" received from the specified standby server."
+						" The transaction has committed locally and might"
+						" also be committed on some recently disconnected"
+						" standby servers.")));
+
+				/*
+				 * We DO NOT want to run proc_exit() callbacks -- we're here because
+				 * we are shutting down and don't want any code to stall or
+				 * prevent that.
+				 */
+				on_exit_reset();
+
+				/*
+				 * Note we do exit(0) not exit(>0). This is to avoid forcing
+				 * postmaster into a system reset cycle if we are the only
+				 * backend that received a SIGTERM.
+				 */
+				exit(0);
+			}
+
 			/*
 			 * We assume the commandTag is plain ASCII and therefore requires
 			 * no encoding conversion.
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5d86deb..7d19b26 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2628,6 +2628,11 @@ die(SIGNAL_ARGS)
 		ProcDiePending = true;
 
 		/*
+		 * Set this proc's wait latch to stop waiting
+		 */
+		SetLatch(&(MyProc->waitLatch));
+
+		/*
 		 * If it's safe to interrupt, and we're waiting for input or a lock,
 		 * service the interrupt immediately
 		 */
@@ -2852,8 +2857,7 @@ ProcessInterrupts(void)
 	 * For SyncRep, we want to accept SIGTERM signals while other interrupts
 	 * are held, so we have a special case solely when WaitingForSyncRep.
 	 */
-	if ((InterruptHoldoffCount != 0 || CritSectionCount != 0) &&
-		!(WaitingForSyncRep && ProcDiePending))
+	if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
 		return;
 	InterruptPending = false;
 	if (ProcDiePending)
@@ -2870,34 +2874,6 @@ ProcessInterrupts(void)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating autovacuum process due to administrator command")));
-		else if (WaitingForSyncRep)
-		{
-			/*
-			 * This must NOT be a FATAL message. We want the state of the
-			 * transaction being aborted to be indeterminate to ensure that
-			 * the transaction completion guarantee is never broken.
-			 */
-			ereport(WARNING,
-					(errcode(ERRCODE_ADMIN_SHUTDOWN),
-					 errmsg("terminating connection because fast shutdown is requested"),
-			errdetail("This connection requested synchronous replication at commit"
-					  " yet confirmation of replication has not been received."
-					  " The transaction has committed locally and might be committed"
-					  " on recently disconnected standby servers also.")));
-
-			/*
-			 * We DO NOT want to run proc_exit() callbacks -- we're here because
-			 * we are shutting down and don't want any code to stall or
-			 * prevent that.
-			 */
-			on_exit_reset();
-
-			/*
-			 * Note we do exit(0) not exit(>0). This is to avoid forcing
-			 * postmaster into a system reset cycle.
-			 */
-			exit(0);
-		}
 		else if (RecoveryConflictPending && RecoveryConflictRetryable)
 		{
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2552e7..16067a5 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -79,7 +79,7 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 extern void ProcessInterrupts(void);
 
 /* in replication/syncrep.c */
-extern bool WaitingForSyncRep;
+extern bool ExitSilentlyAtEndCommand;
 
 #ifndef WIN32

sync_rep.v20.patchtext/x-patch; charset=UTF-8; name=sync_rep.v20.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8684414..8355056 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2018,6 +2018,114 @@ SET ENABLE_SEQSCAN TO OFF;
      </variablelist>
     </sect2>
 
+    <sect2 id="runtime-config-sync-rep">
+     <title>Synchronous Replication</title>
+
+     <para>
+      These settings control the behavior of the built-in
+      <firstterm>synchronous replication</> feature.
+      These parameters would be set on the primary server that is
+      to send replication data to one or more standby servers.
+     </para>
+
+     <variablelist>
+     <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+      <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_replication</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Specifies whether transaction commit will wait for WAL records
+        to be replicated before the command returns a <quote>success</>
+        indication to the client.  The default setting is <literal>off</>.
+        When <literal>on</>, there will be a delay while the client waits
+        for confirmation of successful replication. That delay will
+        increase depending upon the physical distance and network activity
+        between primary and standby. The commit wait will last until a
+        reply from the current synchronous standby indicates it has received
+        the commit record of the transaction. Synchronous standbys must
+        already have been defined (see <xref linkend="guc-sync-standby-names">).
+       </para>
+       <para>
+        This parameter can be changed at any time; the
+        behavior for any one transaction is determined by the setting in
+        effect when it commits.  It is therefore possible, and useful, to have
+        some transactions replicate synchronously and others asynchronously.
+        For example, to make a single multistatement transaction commit
+        asynchronously when the default is synchronous replication, issue
+        <command>SET LOCAL synchronous_replication TO OFF</> within the
+        transaction.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-sync-replication-timeout-client" xreflabel="sync_replication_timeout">
+      <term><varname>sync_replication_timeout</varname> (<type>integer</type>)</term>
+      <indexterm>
+       <primary><varname>sync_replication_timeout</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        If the client has <varname>synchronous_replication</varname> set,
+        and a synchronous standby is currently available
+        then the commit will wait for up to <varname>replication_timeout_client</>
+        seconds before it returns a <quote>success</>. The commit will wait
+        forever for a confirmation when <varname>replication_timeout_client</>
+        is set to 0.
+       </para>
+       <para>
+        If the client has <varname>synchronous_replication</varname> set,
+		and yet no synchronous standby is available when we commit then we
+		don't wait at all.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-sync-standby-names" xreflabel="synchronous_standby_names">
+      <term><varname>synchronous_standby_names</varname> (<type>integer</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_standby_names</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Specifies a priority ordered list of standby names that can offer
+        synchronous replication.  At any one time there will be just one
+        synchronous standby that will wake sleeping users following commit.
+        The synchronous standby will be the first named standby that is
+        both currently connected and streaming in real-time to the standby
+        (as shown by a state of "STREAMING").  Other standby servers
+        with listed later will become potential synchronous standbys.
+        If the current synchronous standby disconnects for whatever reason
+        it will be replaced immediately with the next highest priority standby.
+        Specifying more than one standby name can allow very high availability.
+       </para>
+       <para>
+        The standby name is currently taken as the application_name of the
+        standby, as set in the primary_conninfo on the standby. Names are
+        not enforced for uniqueness. In case of duplicates one of the standbys
+        will be chosen to be the synchronous standby, though exactly which
+        one is indeterminate.
+       </para>
+       <para>
+        The default is the special entry <literal>*</> which matches any
+        application_name, including the default application name of
+        <literal>walsender</>. This is not recommended and a more carefully
+        thought through configuration will be desirable.
+       </para>
+       <para>
+        If a standby is removed from the list of servers then it will stop
+        being the synchronous standby, allowing another to take it's place.
+        If the list is empty, synchronous replication will not be
+        possible, whatever the setting of <varname>synchronous_replication</>.
+        Standbys may also be added to the list without restarting the server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     </variablelist>
+    </sect2>
+
     <sect2 id="runtime-config-standby">
     <title>Standby Servers</title>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 37ba43b..76cd483 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -875,6 +875,233 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
    </sect3>
 
   </sect2>
+  <sect2 id="synchronous-replication">
+   <title>Synchronous Replication</title>
+
+   <indexterm zone="high-availability">
+    <primary>Synchronous Replication</primary>
+   </indexterm>
+
+   <para>
+    <productname>PostgreSQL</> streaming replication is asynchronous by
+    default. If the primary server
+    crashes then some transactions that were committed may not have been
+    replicated to the standby server, causing data loss. The amount
+    of data loss is proportional to the replication delay at the time of
+    failover.
+   </para>
+
+   <para>
+	Synchronous replication offers the ability to confirm that all changes
+	made by a transaction have been transferred to one synchronous standby
+	server. This extends the standard level of durability
+	offered by a transaction commit. This level of protection is referred
+	to as 2-safe replication in computer science theory.
+   </para>
+
+   <para>
+	When requesting synchronous replication, each commit of a
+	write transaction will wait until confirmation is
+	received that the commit has been written to the transaction log on disk
+	of both the primary and standby server. The only possibility that data
+	can be lost is if both the primary and the standby suffer crashes at the
+	same time. This can provide a much higher level of durability, though only
+	if the sysadmin is cautious about the placement and management of the two
+	servers.  Waiting for confirmation increases the user's confidence that the
+	changes will not be lost in the event of server crashes but it also
+	necessarily increases the response time for the requesting transaction.
+	The minimum wait time is the roundtrip time between primary to standby.
+   </para>
+
+   <para>
+	Read only transactions and transaction rollbacks need not wait for
+	replies from standby servers. Subtransaction commits do not wait for
+	responses from standby servers, only top-level commits. Long
+	running actions such as data loading or index building do not wait
+	until the very final commit message. All two-phase commit actions
+	require commit waits, including both prepare and commit.
+   </para>
+
+   <sect3 id="synchronous-replication-config">
+    <title>Basic Configuration</title>
+
+   <para>
+    All parameters have useful default values, so we can enable
+    synchronous replication easily just by setting this on the primary
+
+<programlisting>
+synchronous_replication = on
+</programlisting>
+
+	When <varname>synchronous_replication</> is set, a commit will wait
+	for up to <varname>synchronous_replication_timeout</> seconds to
+	confirm that the standby has received the commit record. Both
+	<varname>synchronous_replication</> and
+	<varname>synchronous_replication_timeout</> can be set by individual
+	users, so can be configured in the configuration file, for particular
+	users or databases, or dynamically by applications programs.
+	It is possible for user sessions to reach timeout even though
+	standbys are communicating normally. In that case, the setting of
+	<varname>synchronous_replication_timeout</> is probably too low though
+	you probably have other system or network issues as well.
+   </para>
+
+   <para>
+    After a commit record has been written to disk on the primary the
+    WAL record is then sent to the standby. The standby sends reply
+    messages each time a new batch of WAL data is received, unless
+	<varname>wal_receiver_status_interval</> is set to zero on the standby.
+	If the standby is the first matching standby, as specified in
+	<varname>synchronous_standby_names</> on the primary, the reply
+	messages from that standby will be used to wake users waiting for
+	confirmation the commit record has been received. These parameters
+	allow the administrator to specify which standby servers should be
+	synchronous standbys. Note that the configuration of synchronous
+	replication is mainly on the master.
+   </para>
+
+   <para>
+    The default setting of <varname>synchronous_replication_timeout</> is
+    120 seconds to ensure that users do not wait forever if all specified
+    standby servers go down. If you wish to have stronger guarantees the
+    timeout can be set higher, or even to zero, meaning wait forever.
+    Users will stop waiting if a fast shutdown is requested, though the
+    server does not fully shutdown until all outstanding WAL records are
+    transferred to standby servers.
+   </para>
+
+   <para>
+    Note also that <varname>synchronous_commit</> is used when the user
+    specifies <varname>synchronous_replication</>, overriding even an
+    explicit setting of <varname>synchronous_commit</> to <literal>off</>.
+    This is because we must write WAL to disk on primary before we replicate
+    to ensure the standby never gets ahead of the primary.
+   </para>
+
+   </sect3>
+
+   <sect3 id="synchronous-replication-performance">
+    <title>Planning for Performance</title>
+
+   <para>
+	Synchronous replication usually requires carefully planned and placed
+	standby servers to ensure applications perform acceptably. Waiting
+	doesn't utilise system resources, but transaction locks continue to be
+	held until the transfer is confirmed. As a result, incautious use of
+	synchronous replication will reduce performance for database
+	applications because of increased response times and higher contention.
+   </para>
+
+   <para>
+	<productname>PostgreSQL</> allows the application developer
+	to specify the durability level required via replication. This can be
+	specified for the system overall, though it can also be specified for
+	specific users or connections, or even individual transactions.
+   </para>
+
+   <para>
+	For example, an application workload might consist of:
+	10% of changes are important customer details, while
+	90% of changes are less important data that the business can more
+	easily survive if it is lost, such as chat messages between users.
+   </para>
+
+   <para>
+	With synchronous replication options specified at the application level
+	(on the primary) we can offer sync rep for the most important changes,
+	without slowing down the bulk of the total workload. Application level
+	options are an important and practical tool for allowing the benefits of
+	synchronous replication for high performance applications.
+   </para>
+
+   <para>
+	You should consider that the network bandwidth must be higher than
+	the rate of generation of WAL data.
+	10% of changes are important customer details, while
+	90% of changes are less important data that the business can more
+	easily survive if it is lost, such as chat messages between users.
+   </para>
+
+   </sect3>
+
+   <sect3 id="synchronous-replication-ha">
+    <title>Planning for High Availability</title>
+
+   <para>
+    The easiest and safest method of gaining High Availability using
+    synchronous replication is to configure at least two standby servers.
+    To understand why, we need to examine what can happen when you lose all
+    standby servers.
+   </para>
+
+   <para>
+    Commits made when synchronous_replication is set will wait until at
+    the sync standby responds. The response may never occur if the last,
+    or only, standby should crash or the network drops. What should we do in
+    that situation?
+   </para>
+
+   <para>
+    If a standby was available immediately after commit we will wait.
+    Sitting and waiting will typically cause operational problems
+	because it is an effective outage of the primary server should all
+	sessions end up waiting. This is why we offer the facility to set
+	<varname>synchronous_replication_timeout</>.
+   </para>
+
+   <para>
+    Once the last synchronous standby has been lost we allow transactions
+    to skip waiting, since we know there isn't anybody to reply, or at
+    least we might expect it to be some time before one returns. You will
+    note that this provides high availability but a primary server working
+    alone could allow changes that are not replicated to other servers,
+    placing your data at risk if the primary fails also.
+   </para>
+
+   <para>
+	The best solution for avoiding data loss is to ensure you don't lose
+	your last remaining sync standby. This can be achieved by naming multiple
+	potential synchronous standbys using <varname>synchronous_standby_names</>.
+	The first named standby will be used as the synchronous standby. Standbys
+	listed after this will takeover the role of synchronous standby if the
+	first one should fail.
+   </para>
+
+   <para>
+	When a standby first attaches to the primary, it will not yet be properly
+	synchronized. This is described as <literal>CATCHUP</> mode. Once
+	the lag between standby and primary reaches zero for the first time
+	we move to real-time <literal>STREAMING</> state.
+	The catch-up duration may be long immediately after the standby has
+	been created. If the standby is shutdown, then the catch-up period
+	will increase according to the length of time the standby has been down.
+	The standby is only able to become a synchronous standby
+	once it has reached <literal>STREAMING</> state.
+   </para>
+
+   <para>
+	If primary crashes while commits are waiting for acknowledgement, those
+	waiting transactions will be marked fully committed once the primary
+	database recovers.
+	There is no way to be certain that all standbys have received all
+	outstanding WAL data at time of the crash of the primary. Some
+	transactions may not show as committed on the standby, even though
+	they show as committed on the primary. The guarantee we offer is that
+	the application will not receive explicit acknowledgement of the
+	successful commit of a transaction until the WAL data is known to be
+	safely received by the standby.
+   </para>
+
+   <para>
+	If you need to re-create a standby server while transactions are
+	waiting, make sure that the commands to run pg_start_backup() and
+	pg_stop_backup() are run in a session with
+	synchronous_replication = off, otherwise those requests will wait
+	forever for the standby to appear.
+   </para>
+
+   </sect3>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index aaa613e..319a57c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -306,8 +306,11 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       location.  In addition, the standby reports the last transaction log
       position it received and wrote, the last position it flushed to disk,
       and the last position it replayed, and this information is also
-      displayed here.  The columns detailing what exactly the connection is
-      doing are only visible if the user examining the view is a superuser.
+      displayed here. If the standby's application names matches one of the
+      settings in <varname>synchronous_standby_names</> then the sync_priority
+      is shown here also, that is the order in which standbys will become
+      the synchronous standby. The columns detailing what exactly the connection
+      is doing are only visible if the user examining the view is a superuser.
       The client's hostname will be available only if
       <xref linkend="guc-log-hostname"> is set or if the user's hostname
       needed to be looked up during <filename>pg_hba.conf</filename>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 287ad26..729c7b7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -56,6 +56,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/fd.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
@@ -1071,6 +1072,14 @@ EndPrepare(GlobalTransaction gxact)
 
 	END_CRIT_SECTION();
 
+	/*
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked the prepare, but still show as
+	 * running in the procarray (twice!) and continue to hold locks.
+	 */
+	SyncRepWaitForLSN(gxact->prepare_lsn);
+
 	records.tail = records.head = NULL;
 }
 
@@ -2030,6 +2039,14 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	MyProc->inCommit = false;
 
 	END_CRIT_SECTION();
+
+	/*
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked clog, but still show as
+	 * running in the procarray and continue to hold locks.
+	 */
+	SyncRepWaitForLSN(recptr);
 }
 
 /*
@@ -2109,4 +2126,12 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	TransactionIdAbortTree(xid, nchildren, children);
 
 	END_CRIT_SECTION();
+
+	/*
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked clog, but still show as
+	 * running in the procarray and continue to hold locks.
+	 */
+	SyncRepWaitForLSN(recptr);
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4b40701..c8b582c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -37,6 +37,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -1055,7 +1056,7 @@ RecordTransactionCommit(void)
 	 * if all to-be-deleted tables are temporary though, since they are lost
 	 * anyway if we crash.)
 	 */
-	if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0)
+	if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
 	{
 		/*
 		 * Synchronous commit case:
@@ -1125,6 +1126,14 @@ RecordTransactionCommit(void)
 	/* Compute latestXid while we have the child XIDs handy */
 	latestXid = TransactionIdLatest(xid, nchildren, children);
 
+	/*
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked clog, but still show as
+	 * running in the procarray and continue to hold locks.
+	 */
+	SyncRepWaitForLSN(XactLastRecEnd);
+
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	XactLastRecEnd.xrecoff = 0;
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c7f43af..3f7d7d9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -520,7 +520,9 @@ CREATE VIEW pg_stat_replication AS
             W.sent_location,
             W.write_location,
             W.flush_location,
-            W.replay_location
+            W.replay_location,
+            W.sync_priority,
+            W.sync_state
     FROM pg_stat_get_activity(NULL) AS S, pg_authid U,
             pg_stat_get_wal_senders() AS W
     WHERE S.usesysid = U.oid AND
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7307c41..efc8e7c 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1527,6 +1527,13 @@ AutoVacWorkerMain(int argc, char *argv[])
 	SetConfigOption("statement_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
 
 	/*
+	 * Force synchronous replication off to allow regular maintenance even
+	 * if we are waiting for standbys to connect. This is important to
+	 * ensure we aren't blocked from performing anti-wraparound tasks.
+	 */
+	SetConfigOption("synchronous_replication", "off", PGC_SUSET, PGC_S_OVERRIDE);
+
+	/*
 	 * Get the info about the database we're going to work on.
 	 */
 	LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 42c6eaf..3fe490e 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
-	repl_gram.o
+	repl_gram.o syncrep.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
new file mode 100644
index 0000000..ac82ebb
--- /dev/null
+++ b/src/backend/replication/syncrep.c
@@ -0,0 +1,503 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.c
+ *
+ * Synchronous replication is new as of PostgreSQL 9.1.
+ *
+ * If requested, transaction commits wait until their commit LSN is
+ * acknowledged by the standby, or the wait hits timeout.
+ *
+ * This module contains the code for waiting and release of backends.
+ * All code in this module executes on the primary. The core streaming
+ * replication transport remains within WALreceiver/WALsender modules.
+ *
+ * The essence of this design is that it isolates all logic about
+ * waiting/releasing onto the primary. The primary defines which standbys
+ * it wishes to wait for. The standby is completely unaware of the
+ * durability requirements of transactions on the primary, reducing the
+ * complexity of the code and streamlining both standby operations and
+ * network bandwidth because there is no requirement to ship
+ * per-transaction state information.
+ *
+ * The bookeeping approach we take is that a commit is either synchronous
+ * or not synchronous (async). If it is async, we just fastpath out of
+ * here. If it is sync, then in 9.1 we wait for the flush location on the
+ * standby before releasing the waiting backend. Further complexity
+ * in that interaction is expected in later releases.
+ *
+ * The best performing way to manage the waiting backends is to have a
+ * single ordered queue of waiting backends, so that we can avoid
+ * searching the through all waiters each time we receive a reply.
+ *
+ * In 9.1 we support only a single synchronous standby, chosen from a
+ * priority list of synchronous_standby_names. Before it can become the
+ * synchronous standby it must have caught up with the primary; that may
+ * take some time. Once caught up, the current highest priority standby
+ * will release waiters from the queue.
+ *
+ * Portions Copyright (c) 2010-2011, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/xlog_internal.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/ipc.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/guc_tables.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+/* User-settable parameters for sync rep */
+bool	sync_rep_mode = false;			/* Only set in user backends */
+int		sync_rep_timeout = 120;			/* Only set in user backends */
+char 	*SyncRepStandbyNames;
+
+bool	WaitingForSyncRep = false;	/* Global state for some exit methods */
+
+static bool announce_next_takeover = true;
+
+static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN);
+static long SyncRepGetWaitTimeout(void);
+
+static int SyncRepGetStandbyPriority(void);
+static int SyncRepWakeQueue(void);
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for normal user backends
+ * ===========================================================
+ */
+
+/*
+ * Wait for synchronous replication, if requested by user.
+ */
+void
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	/*
+	 * Fast exit if user has not requested sync replication, or
+	 * streaming replication is inactive in this server.
+	 */
+	if (!SyncRepRequested() || max_wal_senders == 0)
+		return;
+
+	/*
+	 * Wait on queue. We check for a fast exit once we have the lock.
+	 */
+	SyncRepWaitOnQueue(XactCommitLSN);
+}
+
+void
+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+	if (WaitingForSyncRep && !SHMQueueIsDetached(&(MyProc->syncrep_links)))
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SHMQueueDelete(&(MyProc->syncrep_links));
+		LWLockRelease(SyncRepLock);
+	}
+
+	if (MyProc != NULL)
+		DisownLatch(&MyProc->waitLatch);
+}
+
+/*
+ * Wait for specified LSN to be confirmed at the requested level
+ * of durability. Each proc has its own wait latch, so we perform
+ * a normal latch check/wait loop here.
+ */
+static void
+SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
+{
+	volatile WalSndCtlData *walsndctl = WalSndCtl;
+	TimestampTz	wait_start = GetCurrentTransactionStopTimestamp();
+	long		timeout = SyncRepGetWaitTimeout();
+	char 		*new_status = NULL;
+	const char *old_status;
+	int			len;
+	bool		wait_on_queue = false;
+
+	Assert(SHMQueueIsDetached(&(MyProc->syncrep_links)));
+
+	for (;;)
+	{
+		ResetLatch(&MyProc->waitLatch);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+
+		/*
+		 * First time through, add ourselves to the queue.
+		 */
+		if (SHMQueueIsDetached(&(MyProc->syncrep_links)))
+		{
+			int i;
+
+			/*
+			 * Wait no longer if we have already reached our LSN
+			 */
+			if (XLByteLE(XactCommitLSN, walsndctl->lsn))
+			{
+				/* No need to wait */
+				LWLockRelease(SyncRepLock);
+				return;
+			}
+
+			/*
+			 * Check that we have at least one sync standby active that
+			 * has caught up with the primary.
+			 */
+			for (i = 0; i < max_wal_senders; i++)
+			{
+				/* use volatile pointer to prevent code rearrangement */
+				volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+				if (walsnd->pid != 0 &&
+					walsnd->sync_standby_priority > 0 &&
+					walsnd->state == WALSNDSTATE_STREAMING)
+				{
+					wait_on_queue = true;
+					break;
+				}
+			}
+
+			/*
+			 * Leave quickly if we don't have a sync standby that will
+			 * confirm it has received our commit.
+			 */
+			if (!wait_on_queue)
+			{
+				LWLockRelease(SyncRepLock);
+				return;
+			}
+
+			/*
+			 * Set our waitLSN so WALSender will know when to wake us.
+			 * We set this before we add ourselves to queue, so that
+			 * any proc on the queue can be examined freely without
+			 * taking a lock on each process in the queue.
+			 */
+			MyProc->waitLSN = XactCommitLSN;
+			SHMQueueInsertBefore(&(WalSndCtl->SyncRepQueue), &(MyProc->syncrep_links));
+			LWLockRelease(SyncRepLock);
+			WaitingForSyncRep = true;
+
+			/*
+			 * Alter ps display to show waiting for sync rep.
+			 */
+			if (update_process_title)
+			{
+				old_status = get_ps_display(&len);
+				new_status = (char *) palloc(len + 32 + 1);
+				memcpy(new_status, old_status, len);
+				sprintf(new_status + len, " waiting for %X/%X",
+					 XactCommitLSN.xlogid, XactCommitLSN.xrecoff);
+				set_ps_display(new_status, false);
+				new_status[len] = '\0'; /* truncate off " waiting ..." */
+			}
+		}
+		else
+		{
+			bool release = false;
+			bool timed_out = false;
+			TimestampTz now = GetCurrentTimestamp();
+
+			/*
+			 * Check the LSN on our queue and if it's moved far enough then
+			 * remove us from the queue. First time through this is
+			 * unlikely to be far enough, yet is possible. Next time we are
+			 * woken we should be more lucky.
+			 */
+			if (XLByteLE(XactCommitLSN, walsndctl->lsn))
+				release = true;
+			else if (timeout > 0 &&
+				TimestampDifferenceExceeds(wait_start, now, timeout))
+			{
+				release = true;
+				timed_out = true;
+			}
+
+			if (release)
+			{
+				SHMQueueDelete(&(MyProc->syncrep_links));
+				LWLockRelease(SyncRepLock);
+				WaitingForSyncRep = false;
+
+				/*
+				 * Reset our waitLSN.
+				 */
+				MyProc->waitLSN.xlogid = 0;
+				MyProc->waitLSN.xrecoff = 0;
+
+				if (new_status)
+				{
+					/* Reset ps display */
+					set_ps_display(new_status, false);
+					pfree(new_status);
+				}
+
+				/*
+				 * Our response to the timeout is to simply post a NOTICE and
+				 * then return to the user. The commit has happened, we just
+				 * haven't been able to verify it has been replicated in the
+				 * way requested.
+				 */
+				if (timed_out)
+					ereport(NOTICE,
+							(errmsg("synchronous replication wait for %X/%X timeout at %s",
+										XactCommitLSN.xlogid,
+										XactCommitLSN.xrecoff,
+										timestamptz_to_str(now))));
+				else
+					ereport(DEBUG3,
+							(errmsg("synchronous replication wait for %X/%X complete at %s",
+										XactCommitLSN.xlogid,
+										XactCommitLSN.xrecoff,
+										timestamptz_to_str(now))));
+				return;
+			}
+
+			LWLockRelease(SyncRepLock);
+		}
+
+		WaitLatch(&MyProc->waitLatch, timeout);
+	}
+}
+
+/*
+ * Return a value that we can use directly in WaitLatch(). We need to
+ * handle special values, plus convert from seconds to microseconds.
+ *
+ */
+static long
+SyncRepGetWaitTimeout(void)
+{
+	if (sync_rep_timeout == 0)
+		return -1L;
+
+	return 1000000L * sync_rep_timeout;
+}
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for wal sender processes
+ * ===========================================================
+ */
+
+/*
+ * Take any action required to initialise sync rep state from config
+ * data. Called at WALSender startup and after each SIGHUP.
+ */
+void
+SyncRepInitConfig(void)
+{
+	int priority;
+
+	/*
+	 * Determine if we are a potential sync standby and remember the result
+	 * for handling replies from standby.
+	 */
+	priority = SyncRepGetStandbyPriority();
+	if (MyWalSnd->sync_standby_priority != priority)
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		MyWalSnd->sync_standby_priority = priority;
+		LWLockRelease(SyncRepLock);
+		ereport(DEBUG1,
+				(errmsg("standby \"%s\" now has synchronous standby priority %u",
+						application_name, priority)));
+	}
+}
+
+/*
+ * Update the LSNs on each queue based upon our latest state. This
+ * implements a simple policy of first-valid-standby-releases-waiter.
+ *
+ * Other policies are possible, which would change what we do here and what
+ * perhaps also which information we store as well.
+ */
+void
+SyncRepReleaseWaiters(void)
+{
+	volatile WalSndCtlData *walsndctl = WalSndCtl;
+	volatile WalSnd *syncWalSnd = NULL;
+	int 		numprocs = 0;
+	int			priority = 0;
+	int			i;
+
+	/*
+	 * If this WALSender is serving a standby that is not on the list of
+	 * potential standbys then we have nothing to do. If we are still
+	 * starting up or still running base backup, then leave quicly also.
+	 */
+	if (MyWalSnd->sync_standby_priority == 0 ||
+		MyWalSnd->state < WALSNDSTATE_STREAMING)
+		return;
+
+	/*
+	 * We're a potential sync standby. Release waiters if we are the
+	 * highest priority standby.
+	 */
+	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
+
+		if (walsnd->pid != 0 &&
+			walsnd->sync_standby_priority > 0 &&
+			(priority == 0 ||
+			 priority > walsnd->sync_standby_priority))
+		{
+			 priority = walsnd->sync_standby_priority;
+			 syncWalSnd = walsnd;
+		}
+	}
+
+	/*
+	 * We should have found ourselves at least.
+	 */
+	Assert(syncWalSnd);
+
+	/*
+	 * If we aren't managing the highest priority standby then just leave.
+	 */
+	if (syncWalSnd != MyWalSnd)
+	{
+		LWLockRelease(SyncRepLock);
+		announce_next_takeover = true;
+		return;
+	}
+
+	if (XLByteLT(walsndctl->lsn, MyWalSnd->flush))
+	{
+		/*
+		 * Set the lsn first so that when we wake backends they will
+		 * release up to this location.
+		 */
+		walsndctl->lsn = MyWalSnd->flush;
+		numprocs = SyncRepWakeQueue();
+	}
+
+	LWLockRelease(SyncRepLock);
+
+	elog(DEBUG3, "released %d procs up to %X/%X",
+					numprocs,
+					MyWalSnd->flush.xlogid,
+					MyWalSnd->flush.xrecoff);
+
+	/*
+	 * If we are managing the highest priority standby, though we weren't
+	 * prior to this, then announce we are now the sync standby.
+	 */
+	if (announce_next_takeover)
+	{
+		announce_next_takeover = false;
+		ereport(LOG,
+				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
+						application_name, MyWalSnd->sync_standby_priority)));
+	}
+}
+
+/*
+ * Check if we are in the list of sync standbys, and if so, determine
+ * priority sequence. Return priority if set, or zero to indicate that
+ * we are not a potential sync standby.
+ *
+ * Compare the parameter SyncRepStandbyNames against the application_name
+ * for this WALSender, or allow any name if we find a wildcard "*".
+ */
+static int
+SyncRepGetStandbyPriority(void)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	int			priority = 0;
+	bool		found = false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(SyncRepStandbyNames);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		ereport(FATAL,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		   errmsg("invalid list syntax for parameter \"synchronous_standby_names\"")));
+		return 0;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		priority++;
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return (found ? priority : 0);
+}
+
+/*
+ * Walk queue from head setting the latches of any procs that need
+ * to be woken. We don't modify the queue, we leave that for individual
+ * procs to release themselves.
+ *
+ * Must hold SyncRepLock
+ */
+static int
+SyncRepWakeQueue(void)
+{
+	volatile WalSndCtlData *walsndctl = WalSndCtl;
+	PGPROC	*proc;
+	int		numprocs = 0;
+
+	proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
+								   &(WalSndCtl->SyncRepQueue),
+								   offsetof(PGPROC, syncrep_links));
+
+	while (proc)
+	{
+		/*
+		 * Assume the queue is ordered by LSN
+		 */
+		if (XLByteLT(walsndctl->lsn, proc->waitLSN))
+			return numprocs;
+
+		numprocs++;
+		SetLatch(&(proc->waitLatch));
+		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
+									   &(proc->syncrep_links),
+									   offsetof(PGPROC, syncrep_links));
+	}
+
+	return numprocs;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 49b49d2..5b871fe 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -66,7 +66,7 @@
 WalSndCtlData *WalSndCtl = NULL;
 
 /* My slot in the shared memory array */
-static WalSnd *MyWalSnd = NULL;
+WalSnd *MyWalSnd = NULL;
 
 /* Global state */
 bool		am_walsender = false;		/* Am I a walsender process ? */
@@ -174,6 +174,8 @@ WalSenderMain(void)
 		SpinLockRelease(&walsnd->mutex);
 	}
 
+	SyncRepInitConfig();
+
 	/* Main loop of walsender */
 	return WalSndLoop();
 }
@@ -584,6 +586,8 @@ ProcessStandbyReplyMessage(void)
 		walsnd->apply = reply.apply;
 		SpinLockRelease(&walsnd->mutex);
 	}
+
+	SyncRepReleaseWaiters();
 }
 
 /*
@@ -700,6 +704,7 @@ WalSndLoop(void)
 		{
 			got_SIGHUP = false;
 			ProcessConfigFile(PGC_SIGHUP);
+			SyncRepInitConfig();
 		}
 
 		/*
@@ -771,7 +776,12 @@ WalSndLoop(void)
 		 * that point might wait for some time.
 		 */
 		if (MyWalSnd->state == WALSNDSTATE_CATCHUP && caughtup)
+		{
+			ereport(DEBUG1,
+					(errmsg("standby \"%s\" has now caught up with primary",
+									application_name)));
 			WalSndSetState(WALSNDSTATE_STREAMING);
+		}
 
 		ProcessRepliesIfAny();
 	}
@@ -1238,6 +1248,8 @@ WalSndShmemInit(void)
 		/* First time through, so initialize */
 		MemSet(WalSndCtl, 0, WalSndShmemSize());
 
+		SHMQueueInit(&(WalSndCtl->SyncRepQueue));
+
 		for (i = 0; i < max_wal_senders; i++)
 		{
 			WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
@@ -1304,12 +1316,15 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS 	6
+#define PG_STAT_GET_WAL_SENDERS_COLS 	8
 	ReturnSetInfo	   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc			tupdesc;
 	Tuplestorestate	   *tupstore;
 	MemoryContext		per_query_ctx;
 	MemoryContext		oldcontext;
+	int					sync_priority[max_wal_senders];
+	int					priority = 0;
+	int					sync_standby = -1;
 	int					i;
 
 	/* check to see if caller supports us returning a tuplestore */
@@ -1337,6 +1352,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 	MemoryContextSwitchTo(oldcontext);
 
+	/*
+	 * Get the priorities of sync standbys all in one go, to minimise
+	 * lock acquisitions and to allow us to evaluate who is the current
+	 * sync standby.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0 && walsnd->state == WALSNDSTATE_STREAMING)
+		{
+			sync_priority[i] = walsnd->sync_standby_priority;
+			if (walsnd->sync_standby_priority > 0 &&
+				(priority == 0 ||
+				 priority > walsnd->sync_standby_priority))
+			{
+				priority = walsnd->sync_standby_priority;
+				sync_standby = i;
+			}
+		}
+	}
+	LWLockRelease(SyncRepLock);
+
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		/* use volatile pointer to prevent code rearrangement */
@@ -1370,11 +1410,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * Only superusers can see details. Other users only get
 			 * the pid value to know it's a walsender, but no details.
 			 */
-			nulls[1] = true;
-			nulls[2] = true;
-			nulls[3] = true;
-			nulls[4] = true;
-			nulls[5] = true;
+			MemSet(&nulls[1], true, PG_STAT_GET_WAL_SENDERS_COLS - 1);
 		}
 		else
 		{
@@ -1401,6 +1437,19 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			snprintf(location, sizeof(location), "%X/%X",
 					 apply.xlogid, apply.xrecoff);
 			values[5] = CStringGetTextDatum(location);
+
+			values[6] = Int32GetDatum(sync_priority[i]);
+
+			/*
+			 * More easily understood version of standby state.
+			 * This is purely informational, not different from priority.
+			 */
+			if (sync_priority[i] == 0)
+				values[7] = CStringGetTextDatum("ASYNC");
+			else if (i == sync_standby)
+				values[7] = CStringGetTextDatum("SYNC");
+			else
+				values[7] = CStringGetTextDatum("POTENTIAL");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index afaf599..8c2660c 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -39,6 +39,7 @@
 #include "access/xact.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
@@ -196,6 +197,7 @@ InitProcGlobal(void)
 		PGSemaphoreCreate(&(procs[i].sem));
 		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
 		ProcGlobal->freeProcs = &procs[i];
+		InitSharedLatch(&procs[i].waitLatch);
 	}
 
 	/*
@@ -214,6 +216,7 @@ InitProcGlobal(void)
 		PGSemaphoreCreate(&(procs[i].sem));
 		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
 		ProcGlobal->autovacFreeProcs = &procs[i];
+		InitSharedLatch(&procs[i].waitLatch);
 	}
 
 	/*
@@ -224,6 +227,7 @@ InitProcGlobal(void)
 	{
 		AuxiliaryProcs[i].pid = 0;		/* marks auxiliary proc as not in use */
 		PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
+		InitSharedLatch(&procs[i].waitLatch);
 	}
 
 	/* Create ProcStructLock spinlock, too */
@@ -326,6 +330,12 @@ InitProcess(void)
 		SHMQueueInit(&(MyProc->myProcLocks[i]));
 	MyProc->recoveryConflictPending = false;
 
+	/* Initialise the waitLSN for sync rep */
+	MyProc->waitLSN.xlogid = 0;
+	MyProc->waitLSN.xrecoff = 0;
+
+	OwnLatch((Latch *) &MyProc->waitLatch);
+
 	/*
 	 * We might be reusing a semaphore that belonged to a failed process. So
 	 * be careful and reinitialize its value here.	(This is not strictly
@@ -365,6 +375,7 @@ InitProcessPhase2(void)
 	/*
 	 * Arrange to clean that up at backend exit.
 	 */
+	on_shmem_exit(SyncRepCleanupAtProcExit, 0);
 	on_shmem_exit(RemoveProcFromArray, 0);
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 39b7b5b..5d86deb 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2843,8 +2843,17 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
 void
 ProcessInterrupts(void)
 {
-	/* OK to accept interrupt now? */
-	if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
+	/* 
+	 * OK to accept interrupt now?
+	 *
+	 * Normally this is very straightforward. We don't accept interrupts
+	 * between HOLD_INTERRUPTS() and RESUME_INTERRUPTS().
+	 *
+	 * For SyncRep, we want to accept SIGTERM signals while other interrupts
+	 * are held, so we have a special case solely when WaitingForSyncRep.
+	 */
+	if ((InterruptHoldoffCount != 0 || CritSectionCount != 0) &&
+		!(WaitingForSyncRep && ProcDiePending))
 		return;
 	InterruptPending = false;
 	if (ProcDiePending)
@@ -2861,6 +2870,34 @@ ProcessInterrupts(void)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("terminating autovacuum process due to administrator command")));
+		else if (WaitingForSyncRep)
+		{
+			/*
+			 * This must NOT be a FATAL message. We want the state of the
+			 * transaction being aborted to be indeterminate to ensure that
+			 * the transaction completion guarantee is never broken.
+			 */
+			ereport(WARNING,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("terminating connection because fast shutdown is requested"),
+			errdetail("This connection requested synchronous replication at commit"
+					  " yet confirmation of replication has not been received."
+					  " The transaction has committed locally and might be committed"
+					  " on recently disconnected standby servers also.")));
+
+			/*
+			 * We DO NOT want to run proc_exit() callbacks -- we're here because
+			 * we are shutting down and don't want any code to stall or
+			 * prevent that.
+			 */
+			on_exit_reset();
+
+			/*
+			 * Note we do exit(0) not exit(>0). This is to avoid forcing
+			 * postmaster into a system reset cycle.
+			 */
+			exit(0);
+		}
 		else if (RecoveryConflictPending && RecoveryConflictRetryable)
 		{
 			pgstat_report_recovery_conflict(RecoveryConflictReason);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 529148a..2eb7c20 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -55,6 +55,7 @@
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
@@ -754,6 +755,14 @@ static struct config_bool ConfigureNamesBool[] =
 		true, NULL, NULL
 	},
 	{
+		{"synchronous_replication", PGC_USERSET, WAL_REPLICATION,
+			gettext_noop("Requests synchronous replication."),
+			NULL
+		},
+		&sync_rep_mode,
+		false, NULL, NULL
+	},
+	{
 		{"zero_damaged_pages", PGC_SUSET, DEVELOPER_OPTIONS,
 			gettext_noop("Continues processing past damaged page headers."),
 			gettext_noop("Detection of a damaged page header normally causes PostgreSQL to "
@@ -2161,6 +2170,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"sync_replication_timeout", PGC_USERSET, WAL_REPLICATION,
+			gettext_noop("Sets the maximum wait time for a response from synchronous replication."),
+			gettext_noop("A value of 0 turns off the timeout."),
+			GUC_UNIT_S
+		},
+		&sync_rep_timeout,
+		120, 0, INT_MAX, NULL, NULL
+	},
+
+	{
 		{"track_activity_query_size", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the size reserved for pg_stat_activity.current_query, in bytes."),
 			NULL,
@@ -2717,6 +2736,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"synchronous_standby_names", PGC_SIGHUP, WAL_REPLICATION,
+			gettext_noop("List of potential standby names to synchronise with."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&SyncRepStandbyNames,
+		"*", NULL, NULL
+	},
+
+	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6bfd0fd..81f3b08 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -184,7 +184,17 @@
 #archive_timeout = 0		# force a logfile segment switch after this
 				# number of seconds; 0 disables
 
-# - Streaming Replication -
+# - Replication - User Settings
+
+#synchronous_replication = off		# does commit wait for reply from standby
+#sync_replication_timeout = 120		# 0 means wait forever
+
+# - Streaming Replication - Server Settings
+
+#synchronous_standby_names = '*'	# standby servers that provide sync rep
+				# comma-separated list of application_name from standby(s);
+				# '*' = all (default)
+
 
 #max_wal_senders = 0		# max number of walsender processes
 				# (change requires restart)
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 96a4633..0533e5a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2542,7 +2542,7 @@ DATA(insert OID = 1936 (  pg_stat_get_backend_idset		PGNSP PGUID 12 1 100 0 f f
 DESCR("statistics: currently active backend IDs");
 DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,25,23}" "{i,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_hostname,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active backends");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25}" "{o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,replay_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25,23,25}" "{o,o,o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
 DESCR("statistics: current backend PID");
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index aa8cce5..c2552e7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -78,6 +78,9 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
 
+/* in replication/syncrep.c */
+extern bool WaitingForSyncRep;
+
 #ifndef WIN32
 
 #define CHECK_FOR_INTERRUPTS() \
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
new file mode 100644
index 0000000..d788fe5
--- /dev/null
+++ b/src/include/replication/syncrep.h
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.h
+ *	  Exports from replication/syncrep.c.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _SYNCREP_H
+#define _SYNCREP_H
+
+#include "access/xlog.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+#define SyncRepRequested()				(sync_rep_mode)
+
+/* user-settable parameters for synchronous replication */
+extern bool sync_rep_mode;
+extern int 	sync_rep_timeout;
+extern char *SyncRepStandbyNames;
+
+/* called by user backend */
+extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+
+/* callback at backend exit */
+extern void SyncRepCleanupAtProcExit(int code, Datum arg);
+
+/* called by wal sender */
+extern void SyncRepInitConfig(void);
+extern void SyncRepReleaseWaiters(void);
+
+#endif   /* _SYNCREP_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 5843307..8a8c939 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "nodes/nodes.h"
 #include "storage/latch.h"
+#include "replication/syncrep.h"
 #include "storage/spin.h"
 
 
@@ -52,11 +53,32 @@ typedef struct WalSnd
 	 * to do.
 	 */
 	Latch		latch;
+
+	/*
+	 * The priority order of the standby managed by this WALSender, as
+	 * listed in synchronous_standby_names, or 0 if not-listed.
+	 * Protected by SyncRepLock.
+	 */
+	 int	sync_standby_priority;
 } WalSnd;
 
+extern WalSnd *MyWalSnd;
+
 /* There is one WalSndCtl struct for the whole database cluster */
 typedef struct
 {
+	/*
+	 * Synchronous replication queue. Protected by SyncRepLock.
+	 */
+	SHM_QUEUE SyncRepQueue;
+
+	/*
+	 * Current location of the head of the queue. All waiters should have
+	 * a waitLSN that follows this value, or they are currently being woken
+	 * to remove themselves from the queue. Protected by SyncRepLock.
+	 */
+	XLogRecPtr	lsn;
+
 	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
 } WalSndCtlData;
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ad0bcd7..438a48d 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -78,6 +78,7 @@ typedef enum LWLockId
 	SerializableFinishedListLock,
 	SerializablePredicateLockListLock,
 	OldSerXidLock,
+	SyncRepLock,
 	/* Individual lock IDs end here */
 	FirstBufMappingLock,
 	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 78dbade..091b213 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -14,6 +14,8 @@
 #ifndef _PROC_H_
 #define _PROC_H_
 
+#include "access/xlog.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "storage/pg_sema.h"
 #include "utils/timestamp.h"
@@ -115,6 +117,12 @@ struct PGPROC
 	LOCKMASK	heldLocks;		/* bitmask for lock types already held on this
 								 * lock object by this backend */
 
+	/* Info to allow us to wait for synchronous replication, if needed. */
+	Latch		waitLatch;
+	XLogRecPtr	waitLSN;			/* waiting for this LSN or higher */
+	
+	SHM_QUEUE	syncrep_links;	/* list link if process is in syncrep list */
+
 	/*
 	 * All PROCLOCK objects for locks held or awaited by this backend are
 	 * linked into one of these lists, according to the partition number of
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 02043ab..20cdc39 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1298,7 +1298,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
  pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
  pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
- pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
+ pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location, w.sync_priority, w.sync_state FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
  pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
  pg_stat_sys_tables              | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
  pg_stat_user_functions          | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));

#30

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#29)

Re: Sync Rep v19

On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Almost-working patch attached for the above feature. Time to stop for
the day. Patch against current repo version.

Current repo version attached here also (v20), which includes all fixes
to all known technical issues, major polishing etc..

Thanks for the patch. Now the code about the wait list looks very
simpler than before! Here are the comments:

@@ -1337,6 +1352,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
<snip>
+		if (walsnd->pid != 0 && walsnd->state == WALSNDSTATE_STREAMING)
+		{
+			sync_priority[i] = walsnd->sync_standby_priority;

This always reports the priority of walsender in CATCHUP state as 0.
I don't think that priority needs to be reported as 0.

When new standby which has the same priority as current sync standby
connects, that new standby can switch to new sync one even though
current one is still running. This happens when the index of WalSnd slot
which new standby uses is ahead of that which current one uses. People
don't expect such an unexpected switchover, I think.

+		/*
+		 * Assume the queue is ordered by LSN
+		 */
+		if (XLByteLT(walsndctl->lsn, proc->waitLSN))
+			return numprocs;

The code to ensure the assumption needs to be added.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#31

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#30)

Re: Sync Rep v19

On Sat, 2011-03-05 at 16:13 +0900, Fujii Masao wrote:

On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Almost-working patch attached for the above feature. Time to stop for
the day. Patch against current repo version.

Current repo version attached here also (v20), which includes all fixes
to all known technical issues, major polishing etc..

Thanks for the patch. Now the code about the wait list looks very
simpler than before! Here are the comments:
@@ -1337,6 +1352,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
<snip>
+		if (walsnd->pid != 0 && walsnd->state == WALSNDSTATE_STREAMING)
+		{
+			sync_priority[i] = walsnd->sync_standby_priority;
This always reports the priority of walsender in CATCHUP state as 0.
I don't think that priority needs to be reported as 0.

Cosmetic change. We can do this, yes.

When new standby which has the same priority as current sync standby
connects, that new standby can switch to new sync one even though
current one is still running. This happens when the index of WalSnd slot
which new standby uses is ahead of that which current one uses. People
don't expect such an unexpected switchover, I think.

It is documented that the selection of standby from a set of similar
priorities is indeterminate. Users don't like it, they can change it.

+		/*
+		 * Assume the queue is ordered by LSN
+		 */
+		if (XLByteLT(walsndctl->lsn, proc->waitLSN))
+			return numprocs;

The code to ensure the assumption needs to be added.

Yes, just need to add the code for traversing list backwards.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#32

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#29)

Re: Sync Rep v19

On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Yes, that can happen. As people will no doubt observe, this seems to be
an argument for wait-forever. What we actually need is a wait that lasts
longer than it takes for us to decide to failover, if the standby is
actually up and this is some kind of split brain situation. That way the
clients are still waiting when failover occurs. WAL is missing, but
since we didn't acknowledge the client we are OK to treat that situation
as if it were an abort.

Oracle Data Guard in the maximum availability mode behaves that way?

I'm sure that you are implementing something like the maximum availability
mode rather than the maximum protection one. So I'd like to know how
the data loss situation I described can be avoided in the maximum availability
mode.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#33

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Simon Riggs (#31)

Re: Sync Rep v19

On Sat, 2011-03-05 at 11:04 +0000, Simon Riggs wrote:

+             /*
+              * Assume the queue is ordered by LSN
+              */
+             if (XLByteLT(walsndctl->lsn, proc->waitLSN))
+                     return numprocs;
The code to ensure the assumption needs to be added.
Yes, just need to add the code for traversing list backwards.

I've added code to shmqueue.c to allow this.

New version pushed.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#34

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#31)

Re: Sync Rep v19

On Sat, Mar 5, 2011 at 6:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

It is documented that the selection of standby from a set of similar
priorities is indeterminate. Users don't like it, they can change it.

That doesn't seem like a good argument to *change* the synchronous
standby once it's already set.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Robert Haas (#34)

Re: Sync Rep v19

On Sat, 2011-03-05 at 07:24 -0500, Robert Haas wrote:

On Sat, Mar 5, 2011 at 6:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

It is documented that the selection of standby from a set of similar
priorities is indeterminate. Users don't like it, they can change it.

That doesn't seem like a good argument to *change* the synchronous
standby once it's already set.

If the order is arbitrary, why does it matter if it changes?

The user has the power to specify a sequence, yet they have not done so.
They are told the results are indeterminate, which is accurate. I can
add the words "and may change as new standbys connect" if that helps.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#36

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#35)

Re: Sync Rep v19

On Sat, Mar 5, 2011 at 7:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sat, 2011-03-05 at 07:24 -0500, Robert Haas wrote:

On Sat, Mar 5, 2011 at 6:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

It is documented that the selection of standby from a set of similar
priorities is indeterminate. Users don't like it, they can change it.

That doesn't seem like a good argument to *change* the synchronous
standby once it's already set.

If the order is arbitrary, why does it matter if it changes?

The user has the power to specify a sequence, yet they have not done so.
They are told the results are indeterminate, which is accurate. I can
add the words "and may change as new standbys connect" if that helps.

I just don't think that's very useful behavior. Suppose I have a
master and two standbys. Both are local (or both are remote with
equally good connectivity). When one of the standbys goes down, there
will be a hiccup (i.e. transactions will block trying to commit) until
that guy falls off and the other one takes over. Now, when he comes
back up again, I don't want the synchronous standby to change again;
that seems like a recipe for another hiccup. I think "who the current
synchronous standby is" should act as a tiebreak.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37

Yeb Havinga

yebhavinga@gmail.com

almost 15 years ago

In reply to: Robert Haas (#36)

Re: Sync Rep v19

On Sat, Mar 5, 2011 at 2:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Mar 5, 2011 at 7:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

If the order is arbitrary, why does it matter if it changes?

The user has the power to specify a sequence, yet they have not done so.
They are told the results are indeterminate, which is accurate. I can
add the words "and may change as new standbys connect" if that helps.

I just don't think that's very useful behavior. Suppose I have a
master and two standbys. Both are local (or both are remote with
equally good connectivity). When one of the standbys goes down, there
will be a hiccup (i.e. transactions will block trying to commit) until
that guy falls off and the other one takes over. Now, when he comes
back up again, I don't want the synchronous standby to change again;
that seems like a recipe for another hiccup. I think "who the current
synchronous standby is" should act as a tiebreak.

TLDR part:

The first one might be noticed by users because it takes tens of seconds
before the sync switch. The second hiccup is hardly noticable. However
limiting the # switches of sync standby to the absolute minimum is also good
if e.g. (if there was a hook for it) cluster middleware is notified of the
sync replica change. That might either introduce a race condition or be even
completely unreliable if the notify is sent asynchronous, or it might
introduce a longer lag if the master waits for confirmation of the sync
replica change message. At that point sync replica changes become more
expensive than they are currently.

regards,
Yeb Havinga

#38

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Yeb Havinga (#37)

Re: Sync Rep v19

On Sat, 2011-03-05 at 14:44 +0100, Yeb Havinga wrote:

On Sat, Mar 5, 2011 at 2:05 PM, Robert Haas <robertmhaas@gmail.com>
wrote:
On Sat, Mar 5, 2011 at 7:49 AM, Simon Riggs
<simon@2ndquadrant.com> wrote:

If the order is arbitrary, why does it matter if it changes?

The user has the power to specify a sequence, yet they have

not done so.

They are told the results are indeterminate, which is

accurate. I can

add the words "and may change as new standbys connect" if

that helps.

I just don't think that's very useful behavior. Suppose I
have a
master and two standbys. Both are local (or both are remote
with
equally good connectivity). When one of the standbys goes
down, there
will be a hiccup (i.e. transactions will block trying to
commit) until
that guy falls off and the other one takes over. Now, when he
comes
back up again, I don't want the synchronous standby to change
again;
that seems like a recipe for another hiccup. I think "who the
current
synchronous standby is" should act as a tiebreak.

+1

TLDR part:

The first one might be noticed by users because it takes tens of
seconds before the sync switch. The second hiccup is hardly noticable.
However limiting the # switches of sync standby to the absolute
minimum is also good if e.g. (if there was a hook for it) cluster
middleware is notified of the sync replica change. That might either
introduce a race condition or be even completely unreliable if the
notify is sent asynchronous, or it might introduce a longer lag if the
master waits for confirmation of the sync replica change message. At
that point sync replica changes become more expensive than they are
currently.

I'm not in favour.

If the user has a preferred order, they can specify it. If there is no
preferred order, how will we maintain that order?

What are the rules for maintaining this arbitrary order?

No, this is not something we need prior to commit, if ever.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#39

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#32)

Re: Sync Rep v19

On Sat, 2011-03-05 at 20:08 +0900, Fujii Masao wrote:

On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Yes, that can happen. As people will no doubt observe, this seems to be
an argument for wait-forever. What we actually need is a wait that lasts
longer than it takes for us to decide to failover, if the standby is
actually up and this is some kind of split brain situation. That way the
clients are still waiting when failover occurs. WAL is missing, but
since we didn't acknowledge the client we are OK to treat that situation
as if it were an abort.

Oracle Data Guard in the maximum availability mode behaves that way?

I'm sure that you are implementing something like the maximum availability
mode rather than the maximum protection one. So I'd like to know how
the data loss situation I described can be avoided in the maximum availability
mode.

This is important, so I am taking time to formulate a full reply.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#40

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#33)

Re: Sync Rep v19

On Sat, Mar 5, 2011 at 9:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I've added code to shmqueue.c to allow this.

New version pushed.

New comments;

It looks odd to report the sync_state of walsender in BACKUP
state as ASYNC.

+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+	if (WaitingForSyncRep && !SHMQueueIsDetached(&(MyProc->syncrep_links)))
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SHMQueueDelete(&(MyProc->syncrep_links));
+		LWLockRelease(SyncRepLock);
+	}
+
+	if (MyProc != NULL)
+		DisownLatch(&MyProc->waitLatch);

Can MyProc really be NULL here? If yes, "MyProc != NULL" should be
checked before seeing MyProc->syncrep_links.

Even though postmaster dies, the waiting backend keeps waiting until
the timeout expires. Instead, the backends should periodically check
whether postmaster is alive, and then they should exit immediately
if it's not alive, as well as other process does? If the timeout is
disabled, such backends would get stuck infinitely.

Though I commented about the issue related to shutdown, that was
pointless. So change of ProcessInterrupts is not required unless we
find the need again. Sorry for the noise..

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#41

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#38)

Re: Sync Rep v19

On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm not in favour.

If the user has a preferred order, they can specify it. If there is no
preferred order, how will we maintain that order?

What are the rules for maintaining this arbitrary order?

Probably what Robert, Yeb and I think is to leave the current
sync standby in sync mode until either its connection is closed
or higher priority standby connects. No complicated rule is
required.

To do that, how about tracking which standby is currently in
sync mode? Each walsender checks whether its priority is
higher than that of current sync one, and if yes, it takes over.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#42

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#41)

Re: Sync Rep v19

On Mar 5, 2011, at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm not in favour.

If the user has a preferred order, they can specify it. If there is no
preferred order, how will we maintain that order?

What are the rules for maintaining this arbitrary order?

Probably what Robert, Yeb and I think is to leave the current
sync standby in sync mode until either its connection is closed
or higher priority standby connects. No complicated rule is
required.

To do that, how about tracking which standby is currently in
sync mode? Each walsender checks whether its priority is
higher than that of current sync one, and if yes, it takes over.

That is precisely what I would expect to happen, and IMHO quite useful.

...Robert

#43

Jaime Casanova

jaime@2ndquadrant.com

almost 15 years ago

In reply to: Fujii Masao (#41)

Re: Sync Rep v19

El 05/03/2011 11:18, "Fujii Masao" <masao.fujii@gmail.com> escribió:

On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon@2ndquadrant.com>

wrote:

I'm not in favour.

If the user has a preferred order, they can specify it. If there is no
preferred order, how will we maintain that order?

What are the rules for maintaining this arbitrary order?

Probably what Robert, Yeb and I think is to leave the current
sync standby in sync mode until either its connection is closed
or higher priority standby connects. No complicated rule is
required.

It's not better to remove the code to manage * in synchronous_standby_names?
Once we do that there is no chance of having 2 standbys with the same
priority.

After all, most of the times the dba will need to change the * for a real
list of names anyway. At least in IMHO

--
Jaime Casanova www.2ndQuadrant.com

#44

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#41)

Re: Sync Rep v19

On Sun, 2011-03-06 at 01:17 +0900, Fujii Masao wrote:

On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm not in favour.

If the user has a preferred order, they can specify it. If there is no
preferred order, how will we maintain that order?

What are the rules for maintaining this arbitrary order?

Probably what Robert, Yeb and I think is to leave the current
sync standby in sync mode until either its connection is closed
or higher priority standby connects. No complicated rule is
required.

No, it is complex. The code is intentionally stateless, so unless you
have a rule you cannot work out which one is sync standby.

It is much more important to have robust takeover behaviour.

Changing this will require rethinking how that takeover works. And I'm
not doing that for something that is documented as "indeterminate".

If you care about the sequence then set the supplied parameter, which I
have gone to some trouble to provide.

To do that, how about tracking which standby is currently in
sync mode? Each walsender checks whether its priority is
higher than that of current sync one, and if yes, it takes over.

Regards,

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#45

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Jaime Casanova (#43)

Re: Sync Rep v19

On Sat, 2011-03-05 at 11:42 -0500, Jaime Casanova wrote:

El 05/03/2011 11:18, "Fujii Masao" <masao.fujii@gmail.com> escribió:

On Sun, Mar 6, 2011 at 12:07 AM, Simon Riggs <simon@2ndquadrant.com>

wrote:

I'm not in favour.

If the user has a preferred order, they can specify it. If there

is no

preferred order, how will we maintain that order?

What are the rules for maintaining this arbitrary order?

Probably what Robert, Yeb and I think is to leave the current
sync standby in sync mode until either its connection is closed
or higher priority standby connects. No complicated rule is
required.

It's not better to remove the code to manage * in
synchronous_standby_names? Once we do that there is no chance of
having 2 standbys with the same priority.

Yes, there is, because we don't do duplicate name checking.

I've changed the default so it is no longer "*" by default, to avoid
complaints.

After all, most of the times the dba will need to change the * for a
real list of names anyway. At least in IMHO

--
Jaime Casanova www.2ndQuadrant.com

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#46

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#32)

2 attachment(s)

Re: Sync Rep v19

On Sat, 2011-03-05 at 20:08 +0900, Fujii Masao wrote:

On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Yes, that can happen. As people will no doubt observe, this seems to be
an argument for wait-forever. What we actually need is a wait that lasts
longer than it takes for us to decide to failover, if the standby is
actually up and this is some kind of split brain situation. That way the
clients are still waiting when failover occurs. WAL is missing, but
since we didn't acknowledge the client we are OK to treat that situation
as if it were an abort.

Oracle Data Guard in the maximum availability mode behaves that way?

I'm sure that you are implementing something like the maximum availability
mode rather than the maximum protection one. So I'd like to know how
the data loss situation I described can be avoided in the maximum availability
mode.

It can't. (Oracle or otherwise...)

Once we begin waiting for sync rep, if the transaction or backend ends
then other backends will be able to see the changed data. The only way
to prevent that is to shutdown the database to ensure that no readers or
writers have access to that.

Oracle's protection mechanism is to shutdown the primary if there is no
sync standby available. Maximum Protection. Any other mode must
therefore be less than maximum protection, according to Oracle, and me.
"Available" here means one that has not timed out, via parameter.

Shutting down the main server is cool, as long as you failover to one of
the standbys. If there aren't any standbys, or you don't have a
mechanism for switching quickly, you have availability problems.

What shutting down the server doesn't do is keep the data safe for
transactions that were in their commit-wait phase when the disconnect
occurs. That data exists, yet will not have been transferred to the
standby.

From now, I also say we should wait forever. It is the safest mode and I

want no argument about whether sync rep is safe or not. We can introduce
a more relaxed mode later with high availability for the primary. That
is possible and in some cases desirable.

Now, when we lose last sync standby we have three choices:

1. reconnect the standby, or wait for a potential standby to catchup

2. immediate shutdown of master and failover to one of the standbys

3. reclassify an async standby as a sync standby

More than likely we would attempt to do (1) for a while, then do (2)

This means that when we startup the primary will freeze for a while
until the sync standbys connect. Which is OK, since they try to
reconnect every 5 seconds and we don't plan on shutting down the primary
much anyway.

I've removed the timeout parameter, plus if we begin waiting we wait
until released, forever if that's how long it takes.

The recommendation to use more than one standby remains.

Fast shutdown will wake backends from their latch and there isn't any
changed interrupt behaviour any more.

synchronous_standby_names = '*' is no longer the default

On a positive note this is one less parameter and will improve
performance as well.

All above changes made.

Ready to commit, barring concrete objections to important behaviour.

I will do one final check tomorrow evening then commit.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachments:

sync_rep.v21.context.patchtext/x-patch; charset=UTF-8; name=sync_rep.v21.context.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2018,2023 **** SET ENABLE_SEQSCAN TO OFF;
--- 2018,2109 ----
       </variablelist>
      </sect2>
  
+     <sect2 id="runtime-config-sync-rep">
+      <title>Synchronous Replication</title>
+ 
+      <para>
+       These settings control the behavior of the built-in
+       <firstterm>synchronous replication</> feature.
+       These parameters would be set on the primary server that is
+       to send replication data to one or more standby servers.
+      </para>
+ 
+      <variablelist>
+      <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+       <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+       <indexterm>
+        <primary><varname>synchronous_replication</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies whether transaction commit will wait for WAL records
+         to be replicated before the command returns a <quote>success</>
+         indication to the client.  The default setting is <literal>off</>.
+         When <literal>on</>, there will be a delay while the client waits
+         for confirmation of successful replication. That delay will
+         increase depending upon the physical distance and network activity
+         between primary and standby. The commit wait will last until a
+         reply from the current synchronous standby indicates it has received
+         the commit record of the transaction. Synchronous standbys must
+         already have been defined (see <xref linkend="guc-sync-standby-names">).
+        </para>
+        <para>
+         This parameter can be changed at any time; the
+         behavior for any one transaction is determined by the setting in
+         effect when it commits.  It is therefore possible, and useful, to have
+         some transactions replicate synchronously and others asynchronously.
+         For example, to make a single multistatement transaction commit
+         asynchronously when the default is synchronous replication, issue
+         <command>SET LOCAL synchronous_replication TO OFF</> within the
+         transaction.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-sync-standby-names" xreflabel="synchronous_standby_names">
+       <term><varname>synchronous_standby_names</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>synchronous_standby_names</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies a priority ordered list of standby names that can offer
+         synchronous replication.  At any one time there will be just one
+         synchronous standby that will wake sleeping users following commit.
+         The synchronous standby will be the first named standby that is
+         both currently connected and streaming in real-time to the standby
+         (as shown by a state of "STREAMING").  Other standby servers
+         with listed later will become potential synchronous standbys.
+         If the current synchronous standby disconnects for whatever reason
+         it will be replaced immediately with the next highest priority standby.
+         Specifying more than one standby name can allow very high availability.
+        </para>
+        <para>
+         The standby name is currently taken as the application_name of the
+         standby, as set in the primary_conninfo on the standby. Names are
+         not enforced for uniqueness. In case of duplicates one of the standbys
+         will be chosen to be the synchronous standby, though exactly which
+         one is indeterminate.
+        </para>
+        <para>
+         The default is the special entry <literal>*</> which matches any
+         application_name, including the default application name of
+         <literal>walsender</>. This is not recommended and a more carefully
+         thought through configuration will be desirable.
+        </para>
+        <para>
+         If a standby is removed from the list of servers then it will stop
+         being the synchronous standby, allowing another to take it's place.
+         If the list is empty, synchronous replication will not be
+         possible, whatever the setting of <varname>synchronous_replication</>.
+         Standbys may also be added to the list without restarting the server.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      </variablelist>
+     </sect2>
+ 
      <sect2 id="runtime-config-standby">
      <title>Standby Servers</title>
  
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 875,880 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 875,1083 ----
     </sect3>
  
    </sect2>
+   <sect2 id="synchronous-replication">
+    <title>Synchronous Replication</title>
+ 
+    <indexterm zone="high-availability">
+     <primary>Synchronous Replication</primary>
+    </indexterm>
+ 
+    <para>
+     <productname>PostgreSQL</> streaming replication is asynchronous by
+     default. If the primary server
+     crashes then some transactions that were committed may not have been
+     replicated to the standby server, causing data loss. The amount
+     of data loss is proportional to the replication delay at the time of
+     failover.
+    </para>
+ 
+    <para>
+ 	Synchronous replication offers the ability to confirm that all changes
+ 	made by a transaction have been transferred to one synchronous standby
+ 	server. This extends the standard level of durability
+ 	offered by a transaction commit. This level of protection is referred
+ 	to as 2-safe replication in computer science theory.
+    </para>
+ 
+    <para>
+ 	When requesting synchronous replication, each commit of a
+ 	write transaction will wait until confirmation is
+ 	received that the commit has been written to the transaction log on disk
+ 	of both the primary and standby server. The only possibility that data
+ 	can be lost is if both the primary and the standby suffer crashes at the
+ 	same time. This can provide a much higher level of durability, though only
+ 	if the sysadmin is cautious about the placement and management of the two
+ 	servers.  Waiting for confirmation increases the user's confidence that the
+ 	changes will not be lost in the event of server crashes but it also
+ 	necessarily increases the response time for the requesting transaction.
+ 	The minimum wait time is the roundtrip time between primary to standby.
+    </para>
+ 
+    <para>
+ 	Read only transactions and transaction rollbacks need not wait for
+ 	replies from standby servers. Subtransaction commits do not wait for
+ 	responses from standby servers, only top-level commits. Long
+ 	running actions such as data loading or index building do not wait
+ 	until the very final commit message. All two-phase commit actions
+ 	require commit waits, including both prepare and commit.
+    </para>
+ 
+    <sect3 id="synchronous-replication-config">
+     <title>Basic Configuration</title>
+ 
+    <para>
+     All parameters have useful default values, so we can enable
+     synchronous replication easily just by setting this on the primary
+ 
+ <programlisting>
+ synchronous_replication = on
+ </programlisting>
+ 
+ 	When <varname>synchronous_replication</> is set, a commit will wait
+ 	for confirmation that the standby has received the commit record,
+ 	even if that takes a very long time.
+ 	<varname>synchronous_replication</> can be set by individual
+ 	users, so can be configured in the configuration file, for particular
+ 	users or databases, or dynamically by applications programs.
+    </para>
+ 
+    <para>
+     After a commit record has been written to disk on the primary the
+     WAL record is then sent to the standby. The standby sends reply
+     messages each time a new batch of WAL data is received, unless
+ 	<varname>wal_receiver_status_interval</> is set to zero on the standby.
+ 	If the standby is the first matching standby, as specified in
+ 	<varname>synchronous_standby_names</> on the primary, the reply
+ 	messages from that standby will be used to wake users waiting for
+ 	confirmation the commit record has been received. These parameters
+ 	allow the administrator to specify which standby servers should be
+ 	synchronous standbys. Note that the configuration of synchronous
+ 	replication is mainly on the master.
+    </para>
+ 
+    <para>
+     Users will stop waiting if a fast shutdown is requested, though the
+     server does not fully shutdown until all outstanding WAL records are
+     transferred to standby servers.
+    </para>
+ 
+    <para>
+     Note also that <varname>synchronous_commit</> is used when the user
+     specifies <varname>synchronous_replication</>, overriding even an
+     explicit setting of <varname>synchronous_commit</> to <literal>off</>.
+     This is because we must write WAL to disk on primary before we replicate
+     to ensure the standby never gets ahead of the primary.
+    </para>
+ 
+    </sect3>
+ 
+    <sect3 id="synchronous-replication-performance">
+     <title>Planning for Performance</title>
+ 
+    <para>
+ 	Synchronous replication usually requires carefully planned and placed
+ 	standby servers to ensure applications perform acceptably. Waiting
+ 	doesn't utilise system resources, but transaction locks continue to be
+ 	held until the transfer is confirmed. As a result, incautious use of
+ 	synchronous replication will reduce performance for database
+ 	applications because of increased response times and higher contention.
+    </para>
+ 
+    <para>
+ 	<productname>PostgreSQL</> allows the application developer
+ 	to specify the durability level required via replication. This can be
+ 	specified for the system overall, though it can also be specified for
+ 	specific users or connections, or even individual transactions.
+    </para>
+ 
+    <para>
+ 	For example, an application workload might consist of:
+ 	10% of changes are important customer details, while
+ 	90% of changes are less important data that the business can more
+ 	easily survive if it is lost, such as chat messages between users.
+    </para>
+ 
+    <para>
+ 	With synchronous replication options specified at the application level
+ 	(on the primary) we can offer sync rep for the most important changes,
+ 	without slowing down the bulk of the total workload. Application level
+ 	options are an important and practical tool for allowing the benefits of
+ 	synchronous replication for high performance applications.
+    </para>
+ 
+    <para>
+ 	You should consider that the network bandwidth must be higher than
+ 	the rate of generation of WAL data.
+ 	10% of changes are important customer details, while
+ 	90% of changes are less important data that the business can more
+ 	easily survive if it is lost, such as chat messages between users.
+    </para>
+ 
+    </sect3>
+ 
+    <sect3 id="synchronous-replication-ha">
+     <title>Planning for High Availability</title>
+ 
+    <para>
+     Commits made when synchronous_replication is set will wait until at
+     the sync standby responds. The response may never occur if the last,
+     or only, standby should crash.
+    </para>
+ 
+    <para>
+ 	The best solution for avoiding data loss is to ensure you don't lose
+ 	your last remaining sync standby. This can be achieved by naming multiple
+ 	potential synchronous standbys using <varname>synchronous_standby_names</>.
+ 	The first named standby will be used as the synchronous standby. Standbys
+ 	listed after this will takeover the role of synchronous standby if the
+ 	first one should fail.
+    </para>
+ 
+    <para>
+ 	When a standby first attaches to the primary, it will not yet be properly
+ 	synchronized. This is described as <literal>CATCHUP</> mode. Once
+ 	the lag between standby and primary reaches zero for the first time
+ 	we move to real-time <literal>STREAMING</> state.
+ 	The catch-up duration may be long immediately after the standby has
+ 	been created. If the standby is shutdown, then the catch-up period
+ 	will increase according to the length of time the standby has been down.
+ 	The standby is only able to become a synchronous standby
+ 	once it has reached <literal>STREAMING</> state.
+    </para>
+ 
+    <para>
+ 	If primary restarts while commits are waiting for acknowledgement, those
+ 	waiting transactions will be marked fully committed once the primary
+ 	database recovers.
+ 	There is no way to be certain that all standbys have received all
+ 	outstanding WAL data at time of the crash of the primary. Some
+ 	transactions may not show as committed on the standby, even though
+ 	they show as committed on the primary. The guarantee we offer is that
+ 	the application will not receive explicit acknowledgement of the
+ 	successful commit of a transaction until the WAL data is known to be
+ 	safely received by the standby.
+    </para>
+ 
+    <para>
+     If you really do lose your last standby server then you should disable
+     <varname>synchronous_standby_names</> and restart the primary server.
+    </para>
+ 
+    <para>
+     If the primary is isolated from remaining standby severs you should
+     failover to the best candidate of those other remaining standby servers.
+    </para>
+ 
+    <para>
+ 	If you need to re-create a standby server while transactions are
+ 	waiting, make sure that the commands to run pg_start_backup() and
+ 	pg_stop_backup() are run in a session with
+ 	synchronous_replication = off, otherwise those requests will wait
+ 	forever for the standby to appear.
+    </para>
+ 
+    </sect3>
+   </sect2>
    </sect1>
  
    <sect1 id="warm-standby-failover">
*** a/doc/src/sgml/monitoring.sgml
--- b/doc/src/sgml/monitoring.sgml
***************
*** 306,313 **** postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
        location.  In addition, the standby reports the last transaction log
        position it received and wrote, the last position it flushed to disk,
        and the last position it replayed, and this information is also
!       displayed here.  The columns detailing what exactly the connection is
!       doing are only visible if the user examining the view is a superuser.
        The client's hostname will be available only if
        <xref linkend="guc-log-hostname"> is set or if the user's hostname
        needed to be looked up during <filename>pg_hba.conf</filename>
--- 306,316 ----
        location.  In addition, the standby reports the last transaction log
        position it received and wrote, the last position it flushed to disk,
        and the last position it replayed, and this information is also
!       displayed here. If the standby's application names matches one of the
!       settings in <varname>synchronous_standby_names</> then the sync_priority
!       is shown here also, that is the order in which standbys will become
!       the synchronous standby. The columns detailing what exactly the connection
!       is doing are only visible if the user examining the view is a superuser.
        The client's hostname will be available only if
        <xref linkend="guc-log-hostname"> is set or if the user's hostname
        needed to be looked up during <filename>pg_hba.conf</filename>
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 56,61 ****
--- 56,62 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "replication/walsender.h"
+ #include "replication/syncrep.h"
  #include "storage/fd.h"
  #include "storage/predicate.h"
  #include "storage/procarray.h"
***************
*** 1071,1076 **** EndPrepare(GlobalTransaction gxact)
--- 1072,1085 ----
  
  	END_CRIT_SECTION();
  
+ 	/*
+ 	 * Wait for synchronous replication, if required.
+ 	 *
+ 	 * Note that at this stage we have marked the prepare, but still show as
+ 	 * running in the procarray (twice!) and continue to hold locks.
+ 	 */
+ 	SyncRepWaitForLSN(gxact->prepare_lsn);
+ 
  	records.tail = records.head = NULL;
  }
  
***************
*** 2030,2035 **** RecordTransactionCommitPrepared(TransactionId xid,
--- 2039,2052 ----
  	MyProc->inCommit = false;
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for synchronous replication, if required.
+ 	 *
+ 	 * Note that at this stage we have marked clog, but still show as
+ 	 * running in the procarray and continue to hold locks.
+ 	 */
+ 	SyncRepWaitForLSN(recptr);
  }
  
  /*
***************
*** 2109,2112 **** RecordTransactionAbortPrepared(TransactionId xid,
--- 2126,2137 ----
  	TransactionIdAbortTree(xid, nchildren, children);
  
  	END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * Wait for synchronous replication, if required.
+ 	 *
+ 	 * Note that at this stage we have marked clog, but still show as
+ 	 * running in the procarray and continue to hold locks.
+ 	 */
+ 	SyncRepWaitForLSN(recptr);
  }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 37,42 ****
--- 37,43 ----
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "replication/walsender.h"
+ #include "replication/syncrep.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/lmgr.h"
***************
*** 1055,1061 **** RecordTransactionCommit(void)
  	 * if all to-be-deleted tables are temporary though, since they are lost
  	 * anyway if we crash.)
  	 */
! 	if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0)
  	{
  		/*
  		 * Synchronous commit case:
--- 1056,1062 ----
  	 * if all to-be-deleted tables are temporary though, since they are lost
  	 * anyway if we crash.)
  	 */
! 	if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
  	{
  		/*
  		 * Synchronous commit case:
***************
*** 1125,1130 **** RecordTransactionCommit(void)
--- 1126,1139 ----
  	/* Compute latestXid while we have the child XIDs handy */
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
+ 	/*
+ 	 * Wait for synchronous replication, if required.
+ 	 *
+ 	 * Note that at this stage we have marked clog, but still show as
+ 	 * running in the procarray and continue to hold locks.
+ 	 */
+ 	SyncRepWaitForLSN(XactLastRecEnd);
+ 
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd.xrecoff = 0;
  
*** a/src/backend/catalog/system_views.sql
--- b/src/backend/catalog/system_views.sql
***************
*** 520,526 **** CREATE VIEW pg_stat_replication AS
              W.sent_location,
              W.write_location,
              W.flush_location,
!             W.replay_location
      FROM pg_stat_get_activity(NULL) AS S, pg_authid U,
              pg_stat_get_wal_senders() AS W
      WHERE S.usesysid = U.oid AND
--- 520,528 ----
              W.sent_location,
              W.write_location,
              W.flush_location,
!             W.replay_location,
!             W.sync_priority,
!             W.sync_state
      FROM pg_stat_get_activity(NULL) AS S, pg_authid U,
              pg_stat_get_wal_senders() AS W
      WHERE S.usesysid = U.oid AND
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
***************
*** 1527,1532 **** AutoVacWorkerMain(int argc, char *argv[])
--- 1527,1539 ----
  	SetConfigOption("statement_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
  
  	/*
+ 	 * Force synchronous replication off to allow regular maintenance even
+ 	 * if we are waiting for standbys to connect. This is important to
+ 	 * ensure we aren't blocked from performing anti-wraparound tasks.
+ 	 */
+ 	SetConfigOption("synchronous_replication", "off", PGC_SUSET, PGC_S_OVERRIDE);
+ 
+ 	/*
  	 * Get the info about the database we're going to work on.
  	 */
  	LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
*** a/src/backend/replication/Makefile
--- b/src/backend/replication/Makefile
***************
*** 13,19 **** top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global
  
  OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
! 	repl_gram.o
  
  include $(top_srcdir)/src/backend/common.mk
  
--- 13,19 ----
  include $(top_builddir)/src/Makefile.global
  
  OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
! 	repl_gram.o syncrep.o
  
  include $(top_srcdir)/src/backend/common.mk
  
*** /dev/null
--- b/src/backend/replication/syncrep.c
***************
*** 0 ****
--- 1,460 ----
+ /*-------------------------------------------------------------------------
+  *
+  * syncrep.c
+  *
+  * Synchronous replication is new as of PostgreSQL 9.1.
+  *
+  * If requested, transaction commits wait until their commit LSN is
+  * acknowledged by the sync standby.
+  *
+  * This module contains the code for waiting and release of backends.
+  * All code in this module executes on the primary. The core streaming
+  * replication transport remains within WALreceiver/WALsender modules.
+  *
+  * The essence of this design is that it isolates all logic about
+  * waiting/releasing onto the primary. The primary defines which standbys
+  * it wishes to wait for. The standby is completely unaware of the
+  * durability requirements of transactions on the primary, reducing the
+  * complexity of the code and streamlining both standby operations and
+  * network bandwidth because there is no requirement to ship
+  * per-transaction state information.
+  *
+  * The bookeeping approach we take is that a commit is either synchronous
+  * or not synchronous (async). If it is async, we just fastpath out of
+  * here. If it is sync, then in 9.1 we wait for the flush location on the
+  * standby before releasing the waiting backend. Further complexity
+  * in that interaction is expected in later releases.
+  *
+  * The best performing way to manage the waiting backends is to have a
+  * single ordered queue of waiting backends, so that we can avoid
+  * searching the through all waiters each time we receive a reply.
+  *
+  * In 9.1 we support only a single synchronous standby, chosen from a
+  * priority list of synchronous_standby_names. Before it can become the
+  * synchronous standby it must have caught up with the primary; that may
+  * take some time. Once caught up, the current highest priority standby
+  * will release waiters from the queue.
+  *
+  * Portions Copyright (c) 2010-2011, PostgreSQL Global Development Group
+  *
+  * IDENTIFICATION
+  *	  $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include <unistd.h>
+ 
+ #include "access/xact.h"
+ #include "access/xlog_internal.h"
+ #include "miscadmin.h"
+ #include "postmaster/autovacuum.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
+ #include "storage/latch.h"
+ #include "storage/ipc.h"
+ #include "storage/pmsignal.h"
+ #include "storage/proc.h"
+ #include "utils/builtins.h"
+ #include "utils/guc.h"
+ #include "utils/guc_tables.h"
+ #include "utils/memutils.h"
+ #include "utils/ps_status.h"
+ 
+ /* User-settable parameters for sync rep */
+ bool	sync_rep_mode = false;			/* Only set in user backends */
+ char 	*SyncRepStandbyNames;
+ 
+ static bool announce_next_takeover = true;
+ 
+ static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN);
+ static void SyncRepQueueInsert(void);
+ 
+ static int SyncRepGetStandbyPriority(void);
+ static int SyncRepWakeQueue(void);
+ 
+ /*
+  * ===========================================================
+  * Synchronous Replication functions for normal user backends
+  * ===========================================================
+  */
+ 
+ /*
+  * Wait for synchronous replication, if requested by user.
+  */
+ void
+ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+ {
+ 	/*
+ 	 * Fast exit if user has not requested sync replication, or
+ 	 * streaming replication is inactive in this server.
+ 	 */
+ 	if (!SyncRepRequested() || max_wal_senders == 0)
+ 		return;
+ 
+ 	/*
+ 	 * Wait on queue. We check for a fast exit once we have the lock.
+ 	 */
+ 	SyncRepWaitOnQueue(XactCommitLSN);
+ }
+ 
+ void
+ SyncRepCleanupAtProcExit(int code, Datum arg)
+ {
+ 	if (!SHMQueueIsDetached(&(MyProc->syncrep_links)))
+ 	{
+ 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 		SHMQueueDelete(&(MyProc->syncrep_links));
+ 		LWLockRelease(SyncRepLock);
+ 	}
+ 
+ 	if (MyProc != NULL)
+ 		DisownLatch(&MyProc->waitLatch);
+ }
+ 
+ /*
+  * Wait for specified LSN to be confirmed at the requested level
+  * of durability. Each proc has its own wait latch, so we perform
+  * a normal latch check/wait loop here.
+  */
+ static void
+ SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
+ {
+ 	volatile WalSndCtlData *walsndctl = WalSndCtl;
+ 	char 		*new_status = NULL;
+ 	const char *old_status;
+ 	int			len;
+ 
+ 	Assert(SHMQueueIsDetached(&(MyProc->syncrep_links)));
+ 
+ 	for (;;)
+ 	{
+ 		ResetLatch(&MyProc->waitLatch);
+ 
+ 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 
+ 		/*
+ 		 * First time through, add ourselves to the queue.
+ 		 */
+ 		if (SHMQueueIsDetached(&(MyProc->syncrep_links)))
+ 		{
+ 			/*
+ 			 * Wait no longer if we have already reached our LSN
+ 			 */
+ 			if (XLByteLE(XactCommitLSN, walsndctl->lsn))
+ 			{
+ 				/* No need to wait */
+ 				LWLockRelease(SyncRepLock);
+ 				return;
+ 			}
+ 
+ 			/*
+ 			 * Set our waitLSN so WALSender will know when to wake us.
+ 			 * We set this before we add ourselves to queue, so that
+ 			 * any proc on the queue can be examined freely without
+ 			 * taking a lock on each process in the queue.
+ 			 */
+ 			MyProc->waitLSN = XactCommitLSN;
+ 			SyncRepQueueInsert();
+ 			LWLockRelease(SyncRepLock);
+ 
+ 			/*
+ 			 * Alter ps display to show waiting for sync rep.
+ 			 */
+ 			if (update_process_title)
+ 			{
+ 				old_status = get_ps_display(&len);
+ 				new_status = (char *) palloc(len + 32 + 1);
+ 				memcpy(new_status, old_status, len);
+ 				sprintf(new_status + len, " waiting for %X/%X",
+ 					 XactCommitLSN.xlogid, XactCommitLSN.xrecoff);
+ 				set_ps_display(new_status, false);
+ 				new_status[len] = '\0'; /* truncate off " waiting ..." */
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Check the LSN on our queue and if it's moved far enough then
+ 			 * remove us from the queue. First time through this is
+ 			 * unlikely to be far enough, yet is possible. Next time we are
+ 			 * woken we should be more lucky.
+ 			 */
+ 			if (XLByteLE(XactCommitLSN, walsndctl->lsn))
+ 			{
+ 				SHMQueueDelete(&(MyProc->syncrep_links));
+ 				LWLockRelease(SyncRepLock);
+ 
+ 				/*
+ 				 * Reset our waitLSN.
+ 				 */
+ 				MyProc->waitLSN.xlogid = 0;
+ 				MyProc->waitLSN.xrecoff = 0;
+ 
+ 				if (new_status)
+ 				{
+ 					/* Reset ps display */
+ 					set_ps_display(new_status, false);
+ 					pfree(new_status);
+ 				}
+ 
+ 				ereport(DEBUG3,
+ 						(errmsg("synchronous replication wait for %X/%X complete at %s",
+ 										XactCommitLSN.xlogid,
+ 										XactCommitLSN.xrecoff,
+ 										timestamptz_to_str(GetCurrentTimestamp()))));
+ 				return;
+ 			}
+ 
+ 			LWLockRelease(SyncRepLock);
+ 		}
+ 
+ 		WaitLatch(&MyProc->waitLatch, -1);
+ 	}
+ }
+ 
+ /*
+  * Insert MyProc into SyncRepQueue, maintaining sorted invariant.
+  *
+  * Usually we will go at tail of queue, though its possible that we arrive
+  * here out of order, so start at tail and work back to insertion point.
+  */
+ static void
+ SyncRepQueueInsert(void)
+ {
+ 	PGPROC	*proc;
+ 
+ 	proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
+ 								   &(WalSndCtl->SyncRepQueue),
+ 								   offsetof(PGPROC, syncrep_links));
+ 
+ 	while (proc)
+ 	{
+ 		/*
+ 		 * Stop at the queue element that we should after to
+ 		 * ensure the queue is ordered by LSN.
+ 		 */
+ 		if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
+ 			break;
+ 
+ 		proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
+ 									   &(proc->syncrep_links),
+ 									   offsetof(PGPROC, syncrep_links));
+ 	}
+ 
+ 	if (proc)
+ 		SHMQueueInsertAfter(&(proc->syncrep_links), &(MyProc->syncrep_links));
+ 	else
+ 		SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue), &(MyProc->syncrep_links));
+ }
+ 
+ /*
+  * ===========================================================
+  * Synchronous Replication functions for wal sender processes
+  * ===========================================================
+  */
+ 
+ /*
+  * Take any action required to initialise sync rep state from config
+  * data. Called at WALSender startup and after each SIGHUP.
+  */
+ void
+ SyncRepInitConfig(void)
+ {
+ 	int priority;
+ 
+ 	/*
+ 	 * Determine if we are a potential sync standby and remember the result
+ 	 * for handling replies from standby.
+ 	 */
+ 	priority = SyncRepGetStandbyPriority();
+ 	if (MyWalSnd->sync_standby_priority != priority)
+ 	{
+ 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 		MyWalSnd->sync_standby_priority = priority;
+ 		LWLockRelease(SyncRepLock);
+ 		ereport(DEBUG1,
+ 				(errmsg("standby \"%s\" now has synchronous standby priority %u",
+ 						application_name, priority)));
+ 	}
+ }
+ 
+ /*
+  * Update the LSNs on each queue based upon our latest state. This
+  * implements a simple policy of first-valid-standby-releases-waiter.
+  *
+  * Other policies are possible, which would change what we do here and what
+  * perhaps also which information we store as well.
+  */
+ void
+ SyncRepReleaseWaiters(void)
+ {
+ 	volatile WalSndCtlData *walsndctl = WalSndCtl;
+ 	volatile WalSnd *syncWalSnd = NULL;
+ 	int 		numprocs = 0;
+ 	int			priority = 0;
+ 	int			i;
+ 
+ 	/*
+ 	 * If this WALSender is serving a standby that is not on the list of
+ 	 * potential standbys then we have nothing to do. If we are still
+ 	 * starting up or still running base backup, then leave quicly also.
+ 	 */
+ 	if (MyWalSnd->sync_standby_priority == 0 ||
+ 		MyWalSnd->state < WALSNDSTATE_STREAMING)
+ 		return;
+ 
+ 	/*
+ 	 * We're a potential sync standby. Release waiters if we are the
+ 	 * highest priority standby.
+ 	 */
+ 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 
+ 	for (i = 0; i < max_wal_senders; i++)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
+ 
+ 		if (walsnd->pid != 0 &&
+ 			walsnd->sync_standby_priority > 0 &&
+ 			(priority == 0 ||
+ 			 priority > walsnd->sync_standby_priority))
+ 		{
+ 			 priority = walsnd->sync_standby_priority;
+ 			 syncWalSnd = walsnd;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * We should have found ourselves at least.
+ 	 */
+ 	Assert(syncWalSnd);
+ 
+ 	/*
+ 	 * If we aren't managing the highest priority standby then just leave.
+ 	 */
+ 	if (syncWalSnd != MyWalSnd)
+ 	{
+ 		LWLockRelease(SyncRepLock);
+ 		announce_next_takeover = true;
+ 		return;
+ 	}
+ 
+ 	if (XLByteLT(walsndctl->lsn, MyWalSnd->flush))
+ 	{
+ 		/*
+ 		 * Set the lsn first so that when we wake backends they will
+ 		 * release up to this location.
+ 		 */
+ 		walsndctl->lsn = MyWalSnd->flush;
+ 		numprocs = SyncRepWakeQueue();
+ 	}
+ 
+ 	LWLockRelease(SyncRepLock);
+ 
+ 	elog(DEBUG3, "released %d procs up to %X/%X",
+ 					numprocs,
+ 					MyWalSnd->flush.xlogid,
+ 					MyWalSnd->flush.xrecoff);
+ 
+ 	/*
+ 	 * If we are managing the highest priority standby, though we weren't
+ 	 * prior to this, then announce we are now the sync standby.
+ 	 */
+ 	if (announce_next_takeover)
+ 	{
+ 		announce_next_takeover = false;
+ 		ereport(LOG,
+ 				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
+ 						application_name, MyWalSnd->sync_standby_priority)));
+ 	}
+ }
+ 
+ /*
+  * Check if we are in the list of sync standbys, and if so, determine
+  * priority sequence. Return priority if set, or zero to indicate that
+  * we are not a potential sync standby.
+  *
+  * Compare the parameter SyncRepStandbyNames against the application_name
+  * for this WALSender, or allow any name if we find a wildcard "*".
+  */
+ static int
+ SyncRepGetStandbyPriority(void)
+ {
+ 	char	   *rawstring;
+ 	List	   *elemlist;
+ 	ListCell   *l;
+ 	int			priority = 0;
+ 	bool		found = false;
+ 
+ 	/* Need a modifiable copy of string */
+ 	rawstring = pstrdup(SyncRepStandbyNames);
+ 
+ 	/* Parse string into list of identifiers */
+ 	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+ 	{
+ 		/* syntax error in list */
+ 		pfree(rawstring);
+ 		list_free(elemlist);
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 		   errmsg("invalid list syntax for parameter \"synchronous_standby_names\"")));
+ 		return 0;
+ 	}
+ 
+ 	foreach(l, elemlist)
+ 	{
+ 		char	   *standby_name = (char *) lfirst(l);
+ 
+ 		priority++;
+ 
+ 		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+ 			pg_strcasecmp(standby_name, "*") == 0)
+ 		{
+ 			found = true;
+ 			break;
+ 		}
+ 	}
+ 
+ 	pfree(rawstring);
+ 	list_free(elemlist);
+ 
+ 	return (found ? priority : 0);
+ }
+ 
+ /*
+  * Walk queue from head setting the latches of any procs that need
+  * to be woken. We don't modify the queue, we leave that for individual
+  * procs to release themselves.
+  *
+  * Must hold SyncRepLock
+  */
+ static int
+ SyncRepWakeQueue(void)
+ {
+ 	volatile WalSndCtlData *walsndctl = WalSndCtl;
+ 	PGPROC	*proc;
+ 	int		numprocs = 0;
+ 
+ 	proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
+ 								   &(WalSndCtl->SyncRepQueue),
+ 								   offsetof(PGPROC, syncrep_links));
+ 
+ 	while (proc)
+ 	{
+ 		/*
+ 		 * Assume the queue is ordered by LSN
+ 		 */
+ 		if (XLByteLT(walsndctl->lsn, proc->waitLSN))
+ 			return numprocs;
+ 
+ 		numprocs++;
+ 		SetLatch(&(proc->waitLatch));
+ 		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
+ 									   &(proc->syncrep_links),
+ 									   offsetof(PGPROC, syncrep_links));
+ 	}
+ 
+ 	return numprocs;
+ }
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 66,72 ****
  WalSndCtlData *WalSndCtl = NULL;
  
  /* My slot in the shared memory array */
! static WalSnd *MyWalSnd = NULL;
  
  /* Global state */
  bool		am_walsender = false;		/* Am I a walsender process ? */
--- 66,72 ----
  WalSndCtlData *WalSndCtl = NULL;
  
  /* My slot in the shared memory array */
! WalSnd *MyWalSnd = NULL;
  
  /* Global state */
  bool		am_walsender = false;		/* Am I a walsender process ? */
***************
*** 174,179 **** WalSenderMain(void)
--- 174,181 ----
  		SpinLockRelease(&walsnd->mutex);
  	}
  
+ 	SyncRepInitConfig();
+ 
  	/* Main loop of walsender */
  	return WalSndLoop();
  }
***************
*** 584,589 **** ProcessStandbyReplyMessage(void)
--- 586,593 ----
  		walsnd->apply = reply.apply;
  		SpinLockRelease(&walsnd->mutex);
  	}
+ 
+ 	SyncRepReleaseWaiters();
  }
  
  /*
***************
*** 700,705 **** WalSndLoop(void)
--- 704,710 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 			SyncRepInitConfig();
  		}
  
  		/*
***************
*** 771,777 **** WalSndLoop(void)
--- 776,787 ----
  		 * that point might wait for some time.
  		 */
  		if (MyWalSnd->state == WALSNDSTATE_CATCHUP && caughtup)
+ 		{
+ 			ereport(DEBUG1,
+ 					(errmsg("standby \"%s\" has now caught up with primary",
+ 									application_name)));
  			WalSndSetState(WALSNDSTATE_STREAMING);
+ 		}
  
  		ProcessRepliesIfAny();
  	}
***************
*** 1238,1243 **** WalSndShmemInit(void)
--- 1248,1255 ----
  		/* First time through, so initialize */
  		MemSet(WalSndCtl, 0, WalSndShmemSize());
  
+ 		SHMQueueInit(&(WalSndCtl->SyncRepQueue));
+ 
  		for (i = 0; i < max_wal_senders; i++)
  		{
  			WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
***************
*** 1304,1315 **** WalSndGetStateString(WalSndState state)
  Datum
  pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  {
! #define PG_STAT_GET_WAL_SENDERS_COLS 	6
  	ReturnSetInfo	   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
  	TupleDesc			tupdesc;
  	Tuplestorestate	   *tupstore;
  	MemoryContext		per_query_ctx;
  	MemoryContext		oldcontext;
  	int					i;
  
  	/* check to see if caller supports us returning a tuplestore */
--- 1316,1330 ----
  Datum
  pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  {
! #define PG_STAT_GET_WAL_SENDERS_COLS 	8
  	ReturnSetInfo	   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
  	TupleDesc			tupdesc;
  	Tuplestorestate	   *tupstore;
  	MemoryContext		per_query_ctx;
  	MemoryContext		oldcontext;
+ 	int					sync_priority[max_wal_senders];
+ 	int					priority = 0;
+ 	int					sync_standby = -1;
  	int					i;
  
  	/* check to see if caller supports us returning a tuplestore */
***************
*** 1337,1342 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
--- 1352,1384 ----
  
  	MemoryContextSwitchTo(oldcontext);
  
+ 	/*
+ 	 * Get the priorities of sync standbys all in one go, to minimise
+ 	 * lock acquisitions and to allow us to evaluate who is the current
+ 	 * sync standby.
+ 	 */
+ 	LWLockAcquire(SyncRepLock, LW_SHARED);
+ 	for (i = 0; i < max_wal_senders; i++)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+ 
+ 		if (walsnd->pid != 0)
+ 		{
+ 			sync_priority[i] = walsnd->sync_standby_priority;
+ 
+ 			if (walsnd->state == WALSNDSTATE_STREAMING &&
+ 				walsnd->sync_standby_priority > 0 &&
+ 				(priority == 0 ||
+ 				 priority > walsnd->sync_standby_priority))
+ 			{
+ 				priority = walsnd->sync_standby_priority;
+ 				sync_standby = i;
+ 			}
+ 		}
+ 	}
+ 	LWLockRelease(SyncRepLock);
+ 
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
***************
*** 1370,1380 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  			 * Only superusers can see details. Other users only get
  			 * the pid value to know it's a walsender, but no details.
  			 */
! 			nulls[1] = true;
! 			nulls[2] = true;
! 			nulls[3] = true;
! 			nulls[4] = true;
! 			nulls[5] = true;
  		}
  		else
  		{
--- 1412,1418 ----
  			 * Only superusers can see details. Other users only get
  			 * the pid value to know it's a walsender, but no details.
  			 */
! 			MemSet(&nulls[1], true, PG_STAT_GET_WAL_SENDERS_COLS - 1);
  		}
  		else
  		{
***************
*** 1401,1406 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
--- 1439,1457 ----
  			snprintf(location, sizeof(location), "%X/%X",
  					 apply.xlogid, apply.xrecoff);
  			values[5] = CStringGetTextDatum(location);
+ 
+ 			values[6] = Int32GetDatum(sync_priority[i]);
+ 
+ 			/*
+ 			 * More easily understood version of standby state.
+ 			 * This is purely informational, not different from priority.
+ 			 */
+ 			if (sync_priority[i] == 0)
+ 				values[7] = CStringGetTextDatum("ASYNC");
+ 			else if (i == sync_standby)
+ 				values[7] = CStringGetTextDatum("SYNC");
+ 			else
+ 				values[7] = CStringGetTextDatum("POTENTIAL");
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
*** a/src/backend/storage/ipc/shmqueue.c
--- b/src/backend/storage/ipc/shmqueue.c
***************
*** 104,110 **** SHMQueueInsertBefore(SHM_QUEUE *queue, SHM_QUEUE *elem)
   *		element.  Inserting "after" the queue head puts the elem
   *		at the head of the queue.
   */
- #ifdef NOT_USED
  void
  SHMQueueInsertAfter(SHM_QUEUE *queue, SHM_QUEUE *elem)
  {
--- 104,109 ----
***************
*** 118,124 **** SHMQueueInsertAfter(SHM_QUEUE *queue, SHM_QUEUE *elem)
  	queue->next = elem;
  	nextPtr->prev = elem;
  }
- #endif   /* NOT_USED */
  
  /*--------------------
   * SHMQueueNext -- Get the next element from a queue
--- 117,122 ----
***************
*** 156,161 **** SHMQueueNext(const SHM_QUEUE *queue, const SHM_QUEUE *curElem, Size linkOffset)
--- 154,178 ----
  	return (Pointer) (((char *) elemPtr) - linkOffset);
  }
  
+ /*--------------------
+  * SHMQueuePrev -- Get the previous element from a queue
+  *
+  * Same as SHMQueueNext, just starting at tail and moving towards head
+  * All other comments and usage applies.
+  */
+ Pointer
+ SHMQueuePrev(const SHM_QUEUE *queue, const SHM_QUEUE *curElem, Size linkOffset)
+ {
+ 	SHM_QUEUE  *elemPtr = curElem->prev;
+ 
+ 	Assert(ShmemAddrIsValid(curElem));
+ 
+ 	if (elemPtr == queue)		/* back to the queue head? */
+ 		return NULL;
+ 
+ 	return (Pointer) (((char *) elemPtr) - linkOffset);
+ }
+ 
  /*
   * SHMQueueEmpty -- TRUE if queue head is only element, FALSE otherwise
   */
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 39,44 ****
--- 39,45 ----
  #include "access/xact.h"
  #include "miscadmin.h"
  #include "postmaster/autovacuum.h"
+ #include "replication/syncrep.h"
  #include "storage/ipc.h"
  #include "storage/lmgr.h"
  #include "storage/pmsignal.h"
***************
*** 196,201 **** InitProcGlobal(void)
--- 197,203 ----
  		PGSemaphoreCreate(&(procs[i].sem));
  		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
  		ProcGlobal->freeProcs = &procs[i];
+ 		InitSharedLatch(&procs[i].waitLatch);
  	}
  
  	/*
***************
*** 214,219 **** InitProcGlobal(void)
--- 216,222 ----
  		PGSemaphoreCreate(&(procs[i].sem));
  		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
  		ProcGlobal->autovacFreeProcs = &procs[i];
+ 		InitSharedLatch(&procs[i].waitLatch);
  	}
  
  	/*
***************
*** 224,229 **** InitProcGlobal(void)
--- 227,233 ----
  	{
  		AuxiliaryProcs[i].pid = 0;		/* marks auxiliary proc as not in use */
  		PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
+ 		InitSharedLatch(&procs[i].waitLatch);
  	}
  
  	/* Create ProcStructLock spinlock, too */
***************
*** 326,331 **** InitProcess(void)
--- 330,341 ----
  		SHMQueueInit(&(MyProc->myProcLocks[i]));
  	MyProc->recoveryConflictPending = false;
  
+ 	/* Initialise the waitLSN for sync rep */
+ 	MyProc->waitLSN.xlogid = 0;
+ 	MyProc->waitLSN.xrecoff = 0;
+ 
+ 	OwnLatch((Latch *) &MyProc->waitLatch);
+ 
  	/*
  	 * We might be reusing a semaphore that belonged to a failed process. So
  	 * be careful and reinitialize its value here.	(This is not strictly
***************
*** 365,370 **** InitProcessPhase2(void)
--- 375,381 ----
  	/*
  	 * Arrange to clean that up at backend exit.
  	 */
+ 	on_shmem_exit(SyncRepCleanupAtProcExit, 0);
  	on_shmem_exit(RemoveProcFromArray, 0);
  }
  
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***************
*** 2628,2633 **** die(SIGNAL_ARGS)
--- 2628,2638 ----
  		ProcDiePending = true;
  
  		/*
+ 		 * Set this proc's wait latch to stop waiting
+ 		 */
+ 		SetLatch(&(MyProc->waitLatch));
+ 
+ 		/*
  		 * If it's safe to interrupt, and we're waiting for input or a lock,
  		 * service the interrupt immediately
  		 */
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 55,60 ****
--- 55,61 ----
  #include "postmaster/postmaster.h"
  #include "postmaster/syslogger.h"
  #include "postmaster/walwriter.h"
+ #include "replication/syncrep.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
  #include "storage/bufmgr.h"
***************
*** 754,759 **** static struct config_bool ConfigureNamesBool[] =
--- 755,768 ----
  		true, NULL, NULL
  	},
  	{
+ 		{"synchronous_replication", PGC_USERSET, WAL_REPLICATION,
+ 			gettext_noop("Requests synchronous replication."),
+ 			NULL
+ 		},
+ 		&sync_rep_mode,
+ 		false, NULL, NULL
+ 	},
+ 	{
  		{"zero_damaged_pages", PGC_SUSET, DEVELOPER_OPTIONS,
  			gettext_noop("Continues processing past damaged page headers."),
  			gettext_noop("Detection of a damaged page header normally causes PostgreSQL to "
***************
*** 2717,2722 **** static struct config_string ConfigureNamesString[] =
--- 2726,2741 ----
  	},
  
  	{
+ 		{"synchronous_standby_names", PGC_SIGHUP, WAL_REPLICATION,
+ 			gettext_noop("List of potential standby names to synchronise with."),
+ 			NULL,
+ 			GUC_LIST_INPUT
+ 		},
+ 		&SyncRepStandbyNames,
+ 		"", NULL, NULL
+ 	},
+ 
+ 	{
  		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
  			gettext_noop("Sets default text search configuration."),
  			NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 184,190 ****
  #archive_timeout = 0		# force a logfile segment switch after this
  				# number of seconds; 0 disables
  
! # - Streaming Replication -
  
  #max_wal_senders = 0		# max number of walsender processes
  				# (change requires restart)
--- 184,199 ----
  #archive_timeout = 0		# force a logfile segment switch after this
  				# number of seconds; 0 disables
  
! # - Replication - User Settings
! 
! #synchronous_replication = off		# does commit wait for reply from standby
! 
! # - Streaming Replication - Server Settings
! 
! #synchronous_standby_names = ''	# standby servers that provide sync rep
! 				# comma-separated list of application_name from standby(s);
! 				# '*' = all
! 
  
  #max_wal_senders = 0		# max number of walsender processes
  				# (change requires restart)
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 2542,2548 **** DATA(insert OID = 1936 (  pg_stat_get_backend_idset		PGNSP PGUID 12 1 100 0 f f
  DESCR("statistics: currently active backend IDs");
  DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,25,23}" "{i,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_hostname,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
  DESCR("statistics: information about currently active backends");
! DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25}" "{o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,replay_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
  DESCR("statistics: information about currently active replication");
  DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
  DESCR("statistics: current backend PID");
--- 2542,2548 ----
  DESCR("statistics: currently active backend IDs");
  DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,25,23}" "{i,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_hostname,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
  DESCR("statistics: information about currently active backends");
! DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25,23,25}" "{o,o,o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
  DESCR("statistics: information about currently active replication");
  DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
  DESCR("statistics: current backend PID");
*** /dev/null
--- b/src/include/replication/syncrep.h
***************
*** 0 ****
--- 1,37 ----
+ /*-------------------------------------------------------------------------
+  *
+  * syncrep.h
+  *	  Exports from replication/syncrep.c.
+  *
+  * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+  *
+  * $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef _SYNCREP_H
+ #define _SYNCREP_H
+ 
+ #include "access/xlog.h"
+ #include "storage/proc.h"
+ #include "storage/shmem.h"
+ #include "storage/spin.h"
+ 
+ #define SyncRepRequested()				(sync_rep_mode)
+ 
+ /* user-settable parameters for synchronous replication */
+ extern bool sync_rep_mode;
+ extern int 	sync_rep_timeout;
+ extern char *SyncRepStandbyNames;
+ 
+ /* called by user backend */
+ extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+ 
+ /* callback at backend exit */
+ extern void SyncRepCleanupAtProcExit(int code, Datum arg);
+ 
+ /* called by wal sender */
+ extern void SyncRepInitConfig(void);
+ extern void SyncRepReleaseWaiters(void);
+ 
+ #endif   /* _SYNCREP_H */
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 15,20 ****
--- 15,21 ----
  #include "access/xlog.h"
  #include "nodes/nodes.h"
  #include "storage/latch.h"
+ #include "replication/syncrep.h"
  #include "storage/spin.h"
  
  
***************
*** 52,62 **** typedef struct WalSnd
--- 53,84 ----
  	 * to do.
  	 */
  	Latch		latch;
+ 
+ 	/*
+ 	 * The priority order of the standby managed by this WALSender, as
+ 	 * listed in synchronous_standby_names, or 0 if not-listed.
+ 	 * Protected by SyncRepLock.
+ 	 */
+ 	 int	sync_standby_priority;
  } WalSnd;
  
+ extern WalSnd *MyWalSnd;
+ 
  /* There is one WalSndCtl struct for the whole database cluster */
  typedef struct
  {
+ 	/*
+ 	 * Synchronous replication queue. Protected by SyncRepLock.
+ 	 */
+ 	SHM_QUEUE SyncRepQueue;
+ 
+ 	/*
+ 	 * Current location of the head of the queue. All waiters should have
+ 	 * a waitLSN that follows this value, or they are currently being woken
+ 	 * to remove themselves from the queue. Protected by SyncRepLock.
+ 	 */
+ 	XLogRecPtr	lsn;
+ 
  	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
  } WalSndCtlData;
  
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 78,83 **** typedef enum LWLockId
--- 78,84 ----
  	SerializableFinishedListLock,
  	SerializablePredicateLockListLock,
  	OldSerXidLock,
+ 	SyncRepLock,
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 14,19 ****
--- 14,21 ----
  #ifndef _PROC_H_
  #define _PROC_H_
  
+ #include "access/xlog.h"
+ #include "storage/latch.h"
  #include "storage/lock.h"
  #include "storage/pg_sema.h"
  #include "utils/timestamp.h"
***************
*** 115,120 **** struct PGPROC
--- 117,128 ----
  	LOCKMASK	heldLocks;		/* bitmask for lock types already held on this
  								 * lock object by this backend */
  
+ 	/* Info to allow us to wait for synchronous replication, if needed. */
+ 	Latch		waitLatch;
+ 	XLogRecPtr	waitLSN;			/* waiting for this LSN or higher */
+ 	
+ 	SHM_QUEUE	syncrep_links;	/* list link if process is in syncrep list */
+ 
  	/*
  	 * All PROCLOCK objects for locks held or awaited by this backend are
  	 * linked into one of these lists, according to the partition number of
*** a/src/include/storage/shmem.h
--- b/src/include/storage/shmem.h
***************
*** 67,74 **** extern void SHMQueueInit(SHM_QUEUE *queue);
--- 67,77 ----
  extern void SHMQueueElemInit(SHM_QUEUE *queue);
  extern void SHMQueueDelete(SHM_QUEUE *queue);
  extern void SHMQueueInsertBefore(SHM_QUEUE *queue, SHM_QUEUE *elem);
+ extern void SHMQueueInsertAfter(SHM_QUEUE *queue, SHM_QUEUE *elem);
  extern Pointer SHMQueueNext(const SHM_QUEUE *queue, const SHM_QUEUE *curElem,
  			 Size linkOffset);
+ extern Pointer SHMQueuePrev(const SHM_QUEUE *queue, const SHM_QUEUE *curElem,
+ 			 Size linkOffset);
  extern bool SHMQueueEmpty(const SHM_QUEUE *queue);
  extern bool SHMQueueIsDetached(const SHM_QUEUE *queue);
  
*** a/src/test/regress/expected/rules.out
--- b/src/test/regress/expected/rules.out
***************
*** 1298,1304 **** SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
   pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
   pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
   pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
!  pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
   pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
   pg_stat_sys_tables              | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
   pg_stat_user_functions          | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));
--- 1298,1304 ----
   pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
   pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
   pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
!  pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location, w.sync_priority, w.sync_state FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
   pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
   pg_stat_sys_tables              | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
   pg_stat_user_functions          | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));

sync_rep.v21.patchtext/x-patch; charset=UTF-8; name=sync_rep.v21.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8684414..8dd8c14 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2018,6 +2018,92 @@ SET ENABLE_SEQSCAN TO OFF;
      </variablelist>
     </sect2>
 
+    <sect2 id="runtime-config-sync-rep">
+     <title>Synchronous Replication</title>
+
+     <para>
+      These settings control the behavior of the built-in
+      <firstterm>synchronous replication</> feature.
+      These parameters would be set on the primary server that is
+      to send replication data to one or more standby servers.
+     </para>
+
+     <variablelist>
+     <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+      <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_replication</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Specifies whether transaction commit will wait for WAL records
+        to be replicated before the command returns a <quote>success</>
+        indication to the client.  The default setting is <literal>off</>.
+        When <literal>on</>, there will be a delay while the client waits
+        for confirmation of successful replication. That delay will
+        increase depending upon the physical distance and network activity
+        between primary and standby. The commit wait will last until a
+        reply from the current synchronous standby indicates it has received
+        the commit record of the transaction. Synchronous standbys must
+        already have been defined (see <xref linkend="guc-sync-standby-names">).
+       </para>
+       <para>
+        This parameter can be changed at any time; the
+        behavior for any one transaction is determined by the setting in
+        effect when it commits.  It is therefore possible, and useful, to have
+        some transactions replicate synchronously and others asynchronously.
+        For example, to make a single multistatement transaction commit
+        asynchronously when the default is synchronous replication, issue
+        <command>SET LOCAL synchronous_replication TO OFF</> within the
+        transaction.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-sync-standby-names" xreflabel="synchronous_standby_names">
+      <term><varname>synchronous_standby_names</varname> (<type>integer</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_standby_names</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Specifies a priority ordered list of standby names that can offer
+        synchronous replication.  At any one time there will be just one
+        synchronous standby that will wake sleeping users following commit.
+        The synchronous standby will be the first named standby that is
+        both currently connected and streaming in real-time to the standby
+        (as shown by a state of "STREAMING").  Other standby servers
+        with listed later will become potential synchronous standbys.
+        If the current synchronous standby disconnects for whatever reason
+        it will be replaced immediately with the next highest priority standby.
+        Specifying more than one standby name can allow very high availability.
+       </para>
+       <para>
+        The standby name is currently taken as the application_name of the
+        standby, as set in the primary_conninfo on the standby. Names are
+        not enforced for uniqueness. In case of duplicates one of the standbys
+        will be chosen to be the synchronous standby, though exactly which
+        one is indeterminate.
+       </para>
+       <para>
+        The default is the special entry <literal>*</> which matches any
+        application_name, including the default application name of
+        <literal>walsender</>. This is not recommended and a more carefully
+        thought through configuration will be desirable.
+       </para>
+       <para>
+        If a standby is removed from the list of servers then it will stop
+        being the synchronous standby, allowing another to take it's place.
+        If the list is empty, synchronous replication will not be
+        possible, whatever the setting of <varname>synchronous_replication</>.
+        Standbys may also be added to the list without restarting the server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     </variablelist>
+    </sect2>
+
     <sect2 id="runtime-config-standby">
     <title>Standby Servers</title>
 
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 37ba43b..176a725 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -875,6 +875,209 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
    </sect3>
 
   </sect2>
+  <sect2 id="synchronous-replication">
+   <title>Synchronous Replication</title>
+
+   <indexterm zone="high-availability">
+    <primary>Synchronous Replication</primary>
+   </indexterm>
+
+   <para>
+    <productname>PostgreSQL</> streaming replication is asynchronous by
+    default. If the primary server
+    crashes then some transactions that were committed may not have been
+    replicated to the standby server, causing data loss. The amount
+    of data loss is proportional to the replication delay at the time of
+    failover.
+   </para>
+
+   <para>
+	Synchronous replication offers the ability to confirm that all changes
+	made by a transaction have been transferred to one synchronous standby
+	server. This extends the standard level of durability
+	offered by a transaction commit. This level of protection is referred
+	to as 2-safe replication in computer science theory.
+   </para>
+
+   <para>
+	When requesting synchronous replication, each commit of a
+	write transaction will wait until confirmation is
+	received that the commit has been written to the transaction log on disk
+	of both the primary and standby server. The only possibility that data
+	can be lost is if both the primary and the standby suffer crashes at the
+	same time. This can provide a much higher level of durability, though only
+	if the sysadmin is cautious about the placement and management of the two
+	servers.  Waiting for confirmation increases the user's confidence that the
+	changes will not be lost in the event of server crashes but it also
+	necessarily increases the response time for the requesting transaction.
+	The minimum wait time is the roundtrip time between primary to standby.
+   </para>
+
+   <para>
+	Read only transactions and transaction rollbacks need not wait for
+	replies from standby servers. Subtransaction commits do not wait for
+	responses from standby servers, only top-level commits. Long
+	running actions such as data loading or index building do not wait
+	until the very final commit message. All two-phase commit actions
+	require commit waits, including both prepare and commit.
+   </para>
+
+   <sect3 id="synchronous-replication-config">
+    <title>Basic Configuration</title>
+
+   <para>
+    All parameters have useful default values, so we can enable
+    synchronous replication easily just by setting this on the primary
+
+<programlisting>
+synchronous_replication = on
+</programlisting>
+
+	When <varname>synchronous_replication</> is set, a commit will wait
+	for confirmation that the standby has received the commit record,
+	even if that takes a very long time.
+	<varname>synchronous_replication</> can be set by individual
+	users, so can be configured in the configuration file, for particular
+	users or databases, or dynamically by applications programs.
+   </para>
+
+   <para>
+    After a commit record has been written to disk on the primary the
+    WAL record is then sent to the standby. The standby sends reply
+    messages each time a new batch of WAL data is received, unless
+	<varname>wal_receiver_status_interval</> is set to zero on the standby.
+	If the standby is the first matching standby, as specified in
+	<varname>synchronous_standby_names</> on the primary, the reply
+	messages from that standby will be used to wake users waiting for
+	confirmation the commit record has been received. These parameters
+	allow the administrator to specify which standby servers should be
+	synchronous standbys. Note that the configuration of synchronous
+	replication is mainly on the master.
+   </para>
+
+   <para>
+    Users will stop waiting if a fast shutdown is requested, though the
+    server does not fully shutdown until all outstanding WAL records are
+    transferred to standby servers.
+   </para>
+
+   <para>
+    Note also that <varname>synchronous_commit</> is used when the user
+    specifies <varname>synchronous_replication</>, overriding even an
+    explicit setting of <varname>synchronous_commit</> to <literal>off</>.
+    This is because we must write WAL to disk on primary before we replicate
+    to ensure the standby never gets ahead of the primary.
+   </para>
+
+   </sect3>
+
+   <sect3 id="synchronous-replication-performance">
+    <title>Planning for Performance</title>
+
+   <para>
+	Synchronous replication usually requires carefully planned and placed
+	standby servers to ensure applications perform acceptably. Waiting
+	doesn't utilise system resources, but transaction locks continue to be
+	held until the transfer is confirmed. As a result, incautious use of
+	synchronous replication will reduce performance for database
+	applications because of increased response times and higher contention.
+   </para>
+
+   <para>
+	<productname>PostgreSQL</> allows the application developer
+	to specify the durability level required via replication. This can be
+	specified for the system overall, though it can also be specified for
+	specific users or connections, or even individual transactions.
+   </para>
+
+   <para>
+	For example, an application workload might consist of:
+	10% of changes are important customer details, while
+	90% of changes are less important data that the business can more
+	easily survive if it is lost, such as chat messages between users.
+   </para>
+
+   <para>
+	With synchronous replication options specified at the application level
+	(on the primary) we can offer sync rep for the most important changes,
+	without slowing down the bulk of the total workload. Application level
+	options are an important and practical tool for allowing the benefits of
+	synchronous replication for high performance applications.
+   </para>
+
+   <para>
+	You should consider that the network bandwidth must be higher than
+	the rate of generation of WAL data.
+	10% of changes are important customer details, while
+	90% of changes are less important data that the business can more
+	easily survive if it is lost, such as chat messages between users.
+   </para>
+
+   </sect3>
+
+   <sect3 id="synchronous-replication-ha">
+    <title>Planning for High Availability</title>
+
+   <para>
+    Commits made when synchronous_replication is set will wait until at
+    the sync standby responds. The response may never occur if the last,
+    or only, standby should crash.
+   </para>
+
+   <para>
+	The best solution for avoiding data loss is to ensure you don't lose
+	your last remaining sync standby. This can be achieved by naming multiple
+	potential synchronous standbys using <varname>synchronous_standby_names</>.
+	The first named standby will be used as the synchronous standby. Standbys
+	listed after this will takeover the role of synchronous standby if the
+	first one should fail.
+   </para>
+
+   <para>
+	When a standby first attaches to the primary, it will not yet be properly
+	synchronized. This is described as <literal>CATCHUP</> mode. Once
+	the lag between standby and primary reaches zero for the first time
+	we move to real-time <literal>STREAMING</> state.
+	The catch-up duration may be long immediately after the standby has
+	been created. If the standby is shutdown, then the catch-up period
+	will increase according to the length of time the standby has been down.
+	The standby is only able to become a synchronous standby
+	once it has reached <literal>STREAMING</> state.
+   </para>
+
+   <para>
+	If primary restarts while commits are waiting for acknowledgement, those
+	waiting transactions will be marked fully committed once the primary
+	database recovers.
+	There is no way to be certain that all standbys have received all
+	outstanding WAL data at time of the crash of the primary. Some
+	transactions may not show as committed on the standby, even though
+	they show as committed on the primary. The guarantee we offer is that
+	the application will not receive explicit acknowledgement of the
+	successful commit of a transaction until the WAL data is known to be
+	safely received by the standby.
+   </para>
+
+   <para>
+    If you really do lose your last standby server then you should disable
+    <varname>synchronous_standby_names</> and restart the primary server.
+   </para>
+
+   <para>
+    If the primary is isolated from remaining standby severs you should
+    failover to the best candidate of those other remaining standby servers.
+   </para>
+
+   <para>
+	If you need to re-create a standby server while transactions are
+	waiting, make sure that the commands to run pg_start_backup() and
+	pg_stop_backup() are run in a session with
+	synchronous_replication = off, otherwise those requests will wait
+	forever for the standby to appear.
+   </para>
+
+   </sect3>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index aaa613e..319a57c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -306,8 +306,11 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
       location.  In addition, the standby reports the last transaction log
       position it received and wrote, the last position it flushed to disk,
       and the last position it replayed, and this information is also
-      displayed here.  The columns detailing what exactly the connection is
-      doing are only visible if the user examining the view is a superuser.
+      displayed here. If the standby's application names matches one of the
+      settings in <varname>synchronous_standby_names</> then the sync_priority
+      is shown here also, that is the order in which standbys will become
+      the synchronous standby. The columns detailing what exactly the connection
+      is doing are only visible if the user examining the view is a superuser.
       The client's hostname will be available only if
       <xref linkend="guc-log-hostname"> is set or if the user's hostname
       needed to be looked up during <filename>pg_hba.conf</filename>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 287ad26..729c7b7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -56,6 +56,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/fd.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
@@ -1071,6 +1072,14 @@ EndPrepare(GlobalTransaction gxact)
 
 	END_CRIT_SECTION();
 
+	/*
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked the prepare, but still show as
+	 * running in the procarray (twice!) and continue to hold locks.
+	 */
+	SyncRepWaitForLSN(gxact->prepare_lsn);
+
 	records.tail = records.head = NULL;
 }
 
@@ -2030,6 +2039,14 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	MyProc->inCommit = false;
 
 	END_CRIT_SECTION();
+
+	/*
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked clog, but still show as
+	 * running in the procarray and continue to hold locks.
+	 */
+	SyncRepWaitForLSN(recptr);
 }
 
 /*
@@ -2109,4 +2126,12 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	TransactionIdAbortTree(xid, nchildren, children);
 
 	END_CRIT_SECTION();
+
+	/*
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked clog, but still show as
+	 * running in the procarray and continue to hold locks.
+	 */
+	SyncRepWaitForLSN(recptr);
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4b40701..c8b582c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -37,6 +37,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
@@ -1055,7 +1056,7 @@ RecordTransactionCommit(void)
 	 * if all to-be-deleted tables are temporary though, since they are lost
 	 * anyway if we crash.)
 	 */
-	if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0)
+	if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
 	{
 		/*
 		 * Synchronous commit case:
@@ -1125,6 +1126,14 @@ RecordTransactionCommit(void)
 	/* Compute latestXid while we have the child XIDs handy */
 	latestXid = TransactionIdLatest(xid, nchildren, children);
 
+	/*
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked clog, but still show as
+	 * running in the procarray and continue to hold locks.
+	 */
+	SyncRepWaitForLSN(XactLastRecEnd);
+
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	XactLastRecEnd.xrecoff = 0;
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c7f43af..3f7d7d9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -520,7 +520,9 @@ CREATE VIEW pg_stat_replication AS
             W.sent_location,
             W.write_location,
             W.flush_location,
-            W.replay_location
+            W.replay_location,
+            W.sync_priority,
+            W.sync_state
     FROM pg_stat_get_activity(NULL) AS S, pg_authid U,
             pg_stat_get_wal_senders() AS W
     WHERE S.usesysid = U.oid AND
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7307c41..efc8e7c 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1527,6 +1527,13 @@ AutoVacWorkerMain(int argc, char *argv[])
 	SetConfigOption("statement_timeout", "0", PGC_SUSET, PGC_S_OVERRIDE);
 
 	/*
+	 * Force synchronous replication off to allow regular maintenance even
+	 * if we are waiting for standbys to connect. This is important to
+	 * ensure we aren't blocked from performing anti-wraparound tasks.
+	 */
+	SetConfigOption("synchronous_replication", "off", PGC_SUSET, PGC_S_OVERRIDE);
+
+	/*
 	 * Get the info about the database we're going to work on.
 	 */
 	LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 42c6eaf..3fe490e 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
-	repl_gram.o
+	repl_gram.o syncrep.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
new file mode 100644
index 0000000..a181f29
--- /dev/null
+++ b/src/backend/replication/syncrep.c
@@ -0,0 +1,460 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.c
+ *
+ * Synchronous replication is new as of PostgreSQL 9.1.
+ *
+ * If requested, transaction commits wait until their commit LSN is
+ * acknowledged by the sync standby.
+ *
+ * This module contains the code for waiting and release of backends.
+ * All code in this module executes on the primary. The core streaming
+ * replication transport remains within WALreceiver/WALsender modules.
+ *
+ * The essence of this design is that it isolates all logic about
+ * waiting/releasing onto the primary. The primary defines which standbys
+ * it wishes to wait for. The standby is completely unaware of the
+ * durability requirements of transactions on the primary, reducing the
+ * complexity of the code and streamlining both standby operations and
+ * network bandwidth because there is no requirement to ship
+ * per-transaction state information.
+ *
+ * The bookeeping approach we take is that a commit is either synchronous
+ * or not synchronous (async). If it is async, we just fastpath out of
+ * here. If it is sync, then in 9.1 we wait for the flush location on the
+ * standby before releasing the waiting backend. Further complexity
+ * in that interaction is expected in later releases.
+ *
+ * The best performing way to manage the waiting backends is to have a
+ * single ordered queue of waiting backends, so that we can avoid
+ * searching the through all waiters each time we receive a reply.
+ *
+ * In 9.1 we support only a single synchronous standby, chosen from a
+ * priority list of synchronous_standby_names. Before it can become the
+ * synchronous standby it must have caught up with the primary; that may
+ * take some time. Once caught up, the current highest priority standby
+ * will release waiters from the queue.
+ *
+ * Portions Copyright (c) 2010-2011, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/xlog_internal.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/ipc.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/guc_tables.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+/* User-settable parameters for sync rep */
+bool	sync_rep_mode = false;			/* Only set in user backends */
+char 	*SyncRepStandbyNames;
+
+static bool announce_next_takeover = true;
+
+static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN);
+static void SyncRepQueueInsert(void);
+
+static int SyncRepGetStandbyPriority(void);
+static int SyncRepWakeQueue(void);
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for normal user backends
+ * ===========================================================
+ */
+
+/*
+ * Wait for synchronous replication, if requested by user.
+ */
+void
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	/*
+	 * Fast exit if user has not requested sync replication, or
+	 * streaming replication is inactive in this server.
+	 */
+	if (!SyncRepRequested() || max_wal_senders == 0)
+		return;
+
+	/*
+	 * Wait on queue. We check for a fast exit once we have the lock.
+	 */
+	SyncRepWaitOnQueue(XactCommitLSN);
+}
+
+void
+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+	if (!SHMQueueIsDetached(&(MyProc->syncrep_links)))
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SHMQueueDelete(&(MyProc->syncrep_links));
+		LWLockRelease(SyncRepLock);
+	}
+
+	if (MyProc != NULL)
+		DisownLatch(&MyProc->waitLatch);
+}
+
+/*
+ * Wait for specified LSN to be confirmed at the requested level
+ * of durability. Each proc has its own wait latch, so we perform
+ * a normal latch check/wait loop here.
+ */
+static void
+SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
+{
+	volatile WalSndCtlData *walsndctl = WalSndCtl;
+	char 		*new_status = NULL;
+	const char *old_status;
+	int			len;
+
+	Assert(SHMQueueIsDetached(&(MyProc->syncrep_links)));
+
+	for (;;)
+	{
+		ResetLatch(&MyProc->waitLatch);
+
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+
+		/*
+		 * First time through, add ourselves to the queue.
+		 */
+		if (SHMQueueIsDetached(&(MyProc->syncrep_links)))
+		{
+			/*
+			 * Wait no longer if we have already reached our LSN
+			 */
+			if (XLByteLE(XactCommitLSN, walsndctl->lsn))
+			{
+				/* No need to wait */
+				LWLockRelease(SyncRepLock);
+				return;
+			}
+
+			/*
+			 * Set our waitLSN so WALSender will know when to wake us.
+			 * We set this before we add ourselves to queue, so that
+			 * any proc on the queue can be examined freely without
+			 * taking a lock on each process in the queue.
+			 */
+			MyProc->waitLSN = XactCommitLSN;
+			SyncRepQueueInsert();
+			LWLockRelease(SyncRepLock);
+
+			/*
+			 * Alter ps display to show waiting for sync rep.
+			 */
+			if (update_process_title)
+			{
+				old_status = get_ps_display(&len);
+				new_status = (char *) palloc(len + 32 + 1);
+				memcpy(new_status, old_status, len);
+				sprintf(new_status + len, " waiting for %X/%X",
+					 XactCommitLSN.xlogid, XactCommitLSN.xrecoff);
+				set_ps_display(new_status, false);
+				new_status[len] = '\0'; /* truncate off " waiting ..." */
+			}
+		}
+		else
+		{
+			/*
+			 * Check the LSN on our queue and if it's moved far enough then
+			 * remove us from the queue. First time through this is
+			 * unlikely to be far enough, yet is possible. Next time we are
+			 * woken we should be more lucky.
+			 */
+			if (XLByteLE(XactCommitLSN, walsndctl->lsn))
+			{
+				SHMQueueDelete(&(MyProc->syncrep_links));
+				LWLockRelease(SyncRepLock);
+
+				/*
+				 * Reset our waitLSN.
+				 */
+				MyProc->waitLSN.xlogid = 0;
+				MyProc->waitLSN.xrecoff = 0;
+
+				if (new_status)
+				{
+					/* Reset ps display */
+					set_ps_display(new_status, false);
+					pfree(new_status);
+				}
+
+				ereport(DEBUG3,
+						(errmsg("synchronous replication wait for %X/%X complete at %s",
+										XactCommitLSN.xlogid,
+										XactCommitLSN.xrecoff,
+										timestamptz_to_str(GetCurrentTimestamp()))));
+				return;
+			}
+
+			LWLockRelease(SyncRepLock);
+		}
+
+		WaitLatch(&MyProc->waitLatch, -1);
+	}
+}
+
+/*
+ * Insert MyProc into SyncRepQueue, maintaining sorted invariant.
+ *
+ * Usually we will go at tail of queue, though its possible that we arrive
+ * here out of order, so start at tail and work back to insertion point.
+ */
+static void
+SyncRepQueueInsert(void)
+{
+	PGPROC	*proc;
+
+	proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
+								   &(WalSndCtl->SyncRepQueue),
+								   offsetof(PGPROC, syncrep_links));
+
+	while (proc)
+	{
+		/*
+		 * Stop at the queue element that we should after to
+		 * ensure the queue is ordered by LSN.
+		 */
+		if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
+			break;
+
+		proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
+									   &(proc->syncrep_links),
+									   offsetof(PGPROC, syncrep_links));
+	}
+
+	if (proc)
+		SHMQueueInsertAfter(&(proc->syncrep_links), &(MyProc->syncrep_links));
+	else
+		SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue), &(MyProc->syncrep_links));
+}
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for wal sender processes
+ * ===========================================================
+ */
+
+/*
+ * Take any action required to initialise sync rep state from config
+ * data. Called at WALSender startup and after each SIGHUP.
+ */
+void
+SyncRepInitConfig(void)
+{
+	int priority;
+
+	/*
+	 * Determine if we are a potential sync standby and remember the result
+	 * for handling replies from standby.
+	 */
+	priority = SyncRepGetStandbyPriority();
+	if (MyWalSnd->sync_standby_priority != priority)
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		MyWalSnd->sync_standby_priority = priority;
+		LWLockRelease(SyncRepLock);
+		ereport(DEBUG1,
+				(errmsg("standby \"%s\" now has synchronous standby priority %u",
+						application_name, priority)));
+	}
+}
+
+/*
+ * Update the LSNs on each queue based upon our latest state. This
+ * implements a simple policy of first-valid-standby-releases-waiter.
+ *
+ * Other policies are possible, which would change what we do here and what
+ * perhaps also which information we store as well.
+ */
+void
+SyncRepReleaseWaiters(void)
+{
+	volatile WalSndCtlData *walsndctl = WalSndCtl;
+	volatile WalSnd *syncWalSnd = NULL;
+	int 		numprocs = 0;
+	int			priority = 0;
+	int			i;
+
+	/*
+	 * If this WALSender is serving a standby that is not on the list of
+	 * potential standbys then we have nothing to do. If we are still
+	 * starting up or still running base backup, then leave quicly also.
+	 */
+	if (MyWalSnd->sync_standby_priority == 0 ||
+		MyWalSnd->state < WALSNDSTATE_STREAMING)
+		return;
+
+	/*
+	 * We're a potential sync standby. Release waiters if we are the
+	 * highest priority standby.
+	 */
+	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
+
+		if (walsnd->pid != 0 &&
+			walsnd->sync_standby_priority > 0 &&
+			(priority == 0 ||
+			 priority > walsnd->sync_standby_priority))
+		{
+			 priority = walsnd->sync_standby_priority;
+			 syncWalSnd = walsnd;
+		}
+	}
+
+	/*
+	 * We should have found ourselves at least.
+	 */
+	Assert(syncWalSnd);
+
+	/*
+	 * If we aren't managing the highest priority standby then just leave.
+	 */
+	if (syncWalSnd != MyWalSnd)
+	{
+		LWLockRelease(SyncRepLock);
+		announce_next_takeover = true;
+		return;
+	}
+
+	if (XLByteLT(walsndctl->lsn, MyWalSnd->flush))
+	{
+		/*
+		 * Set the lsn first so that when we wake backends they will
+		 * release up to this location.
+		 */
+		walsndctl->lsn = MyWalSnd->flush;
+		numprocs = SyncRepWakeQueue();
+	}
+
+	LWLockRelease(SyncRepLock);
+
+	elog(DEBUG3, "released %d procs up to %X/%X",
+					numprocs,
+					MyWalSnd->flush.xlogid,
+					MyWalSnd->flush.xrecoff);
+
+	/*
+	 * If we are managing the highest priority standby, though we weren't
+	 * prior to this, then announce we are now the sync standby.
+	 */
+	if (announce_next_takeover)
+	{
+		announce_next_takeover = false;
+		ereport(LOG,
+				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
+						application_name, MyWalSnd->sync_standby_priority)));
+	}
+}
+
+/*
+ * Check if we are in the list of sync standbys, and if so, determine
+ * priority sequence. Return priority if set, or zero to indicate that
+ * we are not a potential sync standby.
+ *
+ * Compare the parameter SyncRepStandbyNames against the application_name
+ * for this WALSender, or allow any name if we find a wildcard "*".
+ */
+static int
+SyncRepGetStandbyPriority(void)
+{
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	int			priority = 0;
+	bool		found = false;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(SyncRepStandbyNames);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', &elemlist))
+	{
+		/* syntax error in list */
+		pfree(rawstring);
+		list_free(elemlist);
+		ereport(FATAL,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		   errmsg("invalid list syntax for parameter \"synchronous_standby_names\"")));
+		return 0;
+	}
+
+	foreach(l, elemlist)
+	{
+		char	   *standby_name = (char *) lfirst(l);
+
+		priority++;
+
+		if (pg_strcasecmp(standby_name, application_name) == 0 ||
+			pg_strcasecmp(standby_name, "*") == 0)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	pfree(rawstring);
+	list_free(elemlist);
+
+	return (found ? priority : 0);
+}
+
+/*
+ * Walk queue from head setting the latches of any procs that need
+ * to be woken. We don't modify the queue, we leave that for individual
+ * procs to release themselves.
+ *
+ * Must hold SyncRepLock
+ */
+static int
+SyncRepWakeQueue(void)
+{
+	volatile WalSndCtlData *walsndctl = WalSndCtl;
+	PGPROC	*proc;
+	int		numprocs = 0;
+
+	proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
+								   &(WalSndCtl->SyncRepQueue),
+								   offsetof(PGPROC, syncrep_links));
+
+	while (proc)
+	{
+		/*
+		 * Assume the queue is ordered by LSN
+		 */
+		if (XLByteLT(walsndctl->lsn, proc->waitLSN))
+			return numprocs;
+
+		numprocs++;
+		SetLatch(&(proc->waitLatch));
+		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
+									   &(proc->syncrep_links),
+									   offsetof(PGPROC, syncrep_links));
+	}
+
+	return numprocs;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 49b49d2..46f7774 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -66,7 +66,7 @@
 WalSndCtlData *WalSndCtl = NULL;
 
 /* My slot in the shared memory array */
-static WalSnd *MyWalSnd = NULL;
+WalSnd *MyWalSnd = NULL;
 
 /* Global state */
 bool		am_walsender = false;		/* Am I a walsender process ? */
@@ -174,6 +174,8 @@ WalSenderMain(void)
 		SpinLockRelease(&walsnd->mutex);
 	}
 
+	SyncRepInitConfig();
+
 	/* Main loop of walsender */
 	return WalSndLoop();
 }
@@ -584,6 +586,8 @@ ProcessStandbyReplyMessage(void)
 		walsnd->apply = reply.apply;
 		SpinLockRelease(&walsnd->mutex);
 	}
+
+	SyncRepReleaseWaiters();
 }
 
 /*
@@ -700,6 +704,7 @@ WalSndLoop(void)
 		{
 			got_SIGHUP = false;
 			ProcessConfigFile(PGC_SIGHUP);
+			SyncRepInitConfig();
 		}
 
 		/*
@@ -771,7 +776,12 @@ WalSndLoop(void)
 		 * that point might wait for some time.
 		 */
 		if (MyWalSnd->state == WALSNDSTATE_CATCHUP && caughtup)
+		{
+			ereport(DEBUG1,
+					(errmsg("standby \"%s\" has now caught up with primary",
+									application_name)));
 			WalSndSetState(WALSNDSTATE_STREAMING);
+		}
 
 		ProcessRepliesIfAny();
 	}
@@ -1238,6 +1248,8 @@ WalSndShmemInit(void)
 		/* First time through, so initialize */
 		MemSet(WalSndCtl, 0, WalSndShmemSize());
 
+		SHMQueueInit(&(WalSndCtl->SyncRepQueue));
+
 		for (i = 0; i < max_wal_senders; i++)
 		{
 			WalSnd	   *walsnd = &WalSndCtl->walsnds[i];
@@ -1304,12 +1316,15 @@ WalSndGetStateString(WalSndState state)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS 	6
+#define PG_STAT_GET_WAL_SENDERS_COLS 	8
 	ReturnSetInfo	   *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc			tupdesc;
 	Tuplestorestate	   *tupstore;
 	MemoryContext		per_query_ctx;
 	MemoryContext		oldcontext;
+	int					sync_priority[max_wal_senders];
+	int					priority = 0;
+	int					sync_standby = -1;
 	int					i;
 
 	/* check to see if caller supports us returning a tuplestore */
@@ -1337,6 +1352,33 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 	MemoryContextSwitchTo(oldcontext);
 
+	/*
+	 * Get the priorities of sync standbys all in one go, to minimise
+	 * lock acquisitions and to allow us to evaluate who is the current
+	 * sync standby.
+	 */
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		if (walsnd->pid != 0)
+		{
+			sync_priority[i] = walsnd->sync_standby_priority;
+
+			if (walsnd->state == WALSNDSTATE_STREAMING &&
+				walsnd->sync_standby_priority > 0 &&
+				(priority == 0 ||
+				 priority > walsnd->sync_standby_priority))
+			{
+				priority = walsnd->sync_standby_priority;
+				sync_standby = i;
+			}
+		}
+	}
+	LWLockRelease(SyncRepLock);
+
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		/* use volatile pointer to prevent code rearrangement */
@@ -1370,11 +1412,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 * Only superusers can see details. Other users only get
 			 * the pid value to know it's a walsender, but no details.
 			 */
-			nulls[1] = true;
-			nulls[2] = true;
-			nulls[3] = true;
-			nulls[4] = true;
-			nulls[5] = true;
+			MemSet(&nulls[1], true, PG_STAT_GET_WAL_SENDERS_COLS - 1);
 		}
 		else
 		{
@@ -1401,6 +1439,19 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			snprintf(location, sizeof(location), "%X/%X",
 					 apply.xlogid, apply.xrecoff);
 			values[5] = CStringGetTextDatum(location);
+
+			values[6] = Int32GetDatum(sync_priority[i]);
+
+			/*
+			 * More easily understood version of standby state.
+			 * This is purely informational, not different from priority.
+			 */
+			if (sync_priority[i] == 0)
+				values[7] = CStringGetTextDatum("ASYNC");
+			else if (i == sync_standby)
+				values[7] = CStringGetTextDatum("SYNC");
+			else
+				values[7] = CStringGetTextDatum("POTENTIAL");
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/storage/ipc/shmqueue.c b/src/backend/storage/ipc/shmqueue.c
index 1cf69a0..5d684b2 100644
--- a/src/backend/storage/ipc/shmqueue.c
+++ b/src/backend/storage/ipc/shmqueue.c
@@ -104,7 +104,6 @@ SHMQueueInsertBefore(SHM_QUEUE *queue, SHM_QUEUE *elem)
  *		element.  Inserting "after" the queue head puts the elem
  *		at the head of the queue.
  */
-#ifdef NOT_USED
 void
 SHMQueueInsertAfter(SHM_QUEUE *queue, SHM_QUEUE *elem)
 {
@@ -118,7 +117,6 @@ SHMQueueInsertAfter(SHM_QUEUE *queue, SHM_QUEUE *elem)
 	queue->next = elem;
 	nextPtr->prev = elem;
 }
-#endif   /* NOT_USED */
 
 /*--------------------
  * SHMQueueNext -- Get the next element from a queue
@@ -156,6 +154,25 @@ SHMQueueNext(const SHM_QUEUE *queue, const SHM_QUEUE *curElem, Size linkOffset)
 	return (Pointer) (((char *) elemPtr) - linkOffset);
 }
 
+/*--------------------
+ * SHMQueuePrev -- Get the previous element from a queue
+ *
+ * Same as SHMQueueNext, just starting at tail and moving towards head
+ * All other comments and usage applies.
+ */
+Pointer
+SHMQueuePrev(const SHM_QUEUE *queue, const SHM_QUEUE *curElem, Size linkOffset)
+{
+	SHM_QUEUE  *elemPtr = curElem->prev;
+
+	Assert(ShmemAddrIsValid(curElem));
+
+	if (elemPtr == queue)		/* back to the queue head? */
+		return NULL;
+
+	return (Pointer) (((char *) elemPtr) - linkOffset);
+}
+
 /*
  * SHMQueueEmpty -- TRUE if queue head is only element, FALSE otherwise
  */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index afaf599..8c2660c 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -39,6 +39,7 @@
 #include "access/xact.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
@@ -196,6 +197,7 @@ InitProcGlobal(void)
 		PGSemaphoreCreate(&(procs[i].sem));
 		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
 		ProcGlobal->freeProcs = &procs[i];
+		InitSharedLatch(&procs[i].waitLatch);
 	}
 
 	/*
@@ -214,6 +216,7 @@ InitProcGlobal(void)
 		PGSemaphoreCreate(&(procs[i].sem));
 		procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
 		ProcGlobal->autovacFreeProcs = &procs[i];
+		InitSharedLatch(&procs[i].waitLatch);
 	}
 
 	/*
@@ -224,6 +227,7 @@ InitProcGlobal(void)
 	{
 		AuxiliaryProcs[i].pid = 0;		/* marks auxiliary proc as not in use */
 		PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
+		InitSharedLatch(&procs[i].waitLatch);
 	}
 
 	/* Create ProcStructLock spinlock, too */
@@ -326,6 +330,12 @@ InitProcess(void)
 		SHMQueueInit(&(MyProc->myProcLocks[i]));
 	MyProc->recoveryConflictPending = false;
 
+	/* Initialise the waitLSN for sync rep */
+	MyProc->waitLSN.xlogid = 0;
+	MyProc->waitLSN.xrecoff = 0;
+
+	OwnLatch((Latch *) &MyProc->waitLatch);
+
 	/*
 	 * We might be reusing a semaphore that belonged to a failed process. So
 	 * be careful and reinitialize its value here.	(This is not strictly
@@ -365,6 +375,7 @@ InitProcessPhase2(void)
 	/*
 	 * Arrange to clean that up at backend exit.
 	 */
+	on_shmem_exit(SyncRepCleanupAtProcExit, 0);
 	on_shmem_exit(RemoveProcFromArray, 0);
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 39b7b5b..b4163e0 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2628,6 +2628,11 @@ die(SIGNAL_ARGS)
 		ProcDiePending = true;
 
 		/*
+		 * Set this proc's wait latch to stop waiting
+		 */
+		SetLatch(&(MyProc->waitLatch));
+
+		/*
 		 * If it's safe to interrupt, and we're waiting for input or a lock,
 		 * service the interrupt immediately
 		 */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 529148a..b9568a4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -55,6 +55,7 @@
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/bufmgr.h"
@@ -754,6 +755,14 @@ static struct config_bool ConfigureNamesBool[] =
 		true, NULL, NULL
 	},
 	{
+		{"synchronous_replication", PGC_USERSET, WAL_REPLICATION,
+			gettext_noop("Requests synchronous replication."),
+			NULL
+		},
+		&sync_rep_mode,
+		false, NULL, NULL
+	},
+	{
 		{"zero_damaged_pages", PGC_SUSET, DEVELOPER_OPTIONS,
 			gettext_noop("Continues processing past damaged page headers."),
 			gettext_noop("Detection of a damaged page header normally causes PostgreSQL to "
@@ -2717,6 +2726,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"synchronous_standby_names", PGC_SIGHUP, WAL_REPLICATION,
+			gettext_noop("List of potential standby names to synchronise with."),
+			NULL,
+			GUC_LIST_INPUT
+		},
+		&SyncRepStandbyNames,
+		"", NULL, NULL
+	},
+
+	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6bfd0fd..ed70223 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -184,7 +184,16 @@
 #archive_timeout = 0		# force a logfile segment switch after this
 				# number of seconds; 0 disables
 
-# - Streaming Replication -
+# - Replication - User Settings
+
+#synchronous_replication = off		# does commit wait for reply from standby
+
+# - Streaming Replication - Server Settings
+
+#synchronous_standby_names = ''	# standby servers that provide sync rep
+				# comma-separated list of application_name from standby(s);
+				# '*' = all
+
 
 #max_wal_senders = 0		# max number of walsender processes
 				# (change requires restart)
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 96a4633..0533e5a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2542,7 +2542,7 @@ DATA(insert OID = 1936 (  pg_stat_get_backend_idset		PGNSP PGUID 12 1 100 0 f f
 DESCR("statistics: currently active backend IDs");
 DATA(insert OID = 2022 (  pg_stat_get_activity			PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,25,23}" "{i,o,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_hostname,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active backends");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25}" "{o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,replay_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders	PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25,23,25}" "{o,o,o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,replay_location,sync_priority,sync_state}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
 DESCR("statistics: current backend PID");
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
new file mode 100644
index 0000000..d788fe5
--- /dev/null
+++ b/src/include/replication/syncrep.h
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.h
+ *	  Exports from replication/syncrep.c.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _SYNCREP_H
+#define _SYNCREP_H
+
+#include "access/xlog.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+#define SyncRepRequested()				(sync_rep_mode)
+
+/* user-settable parameters for synchronous replication */
+extern bool sync_rep_mode;
+extern int 	sync_rep_timeout;
+extern char *SyncRepStandbyNames;
+
+/* called by user backend */
+extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+
+/* callback at backend exit */
+extern void SyncRepCleanupAtProcExit(int code, Datum arg);
+
+/* called by wal sender */
+extern void SyncRepInitConfig(void);
+extern void SyncRepReleaseWaiters(void);
+
+#endif   /* _SYNCREP_H */
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 5843307..8a8c939 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -15,6 +15,7 @@
 #include "access/xlog.h"
 #include "nodes/nodes.h"
 #include "storage/latch.h"
+#include "replication/syncrep.h"
 #include "storage/spin.h"
 
 
@@ -52,11 +53,32 @@ typedef struct WalSnd
 	 * to do.
 	 */
 	Latch		latch;
+
+	/*
+	 * The priority order of the standby managed by this WALSender, as
+	 * listed in synchronous_standby_names, or 0 if not-listed.
+	 * Protected by SyncRepLock.
+	 */
+	 int	sync_standby_priority;
 } WalSnd;
 
+extern WalSnd *MyWalSnd;
+
 /* There is one WalSndCtl struct for the whole database cluster */
 typedef struct
 {
+	/*
+	 * Synchronous replication queue. Protected by SyncRepLock.
+	 */
+	SHM_QUEUE SyncRepQueue;
+
+	/*
+	 * Current location of the head of the queue. All waiters should have
+	 * a waitLSN that follows this value, or they are currently being woken
+	 * to remove themselves from the queue. Protected by SyncRepLock.
+	 */
+	XLogRecPtr	lsn;
+
 	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
 } WalSndCtlData;
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ad0bcd7..438a48d 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -78,6 +78,7 @@ typedef enum LWLockId
 	SerializableFinishedListLock,
 	SerializablePredicateLockListLock,
 	OldSerXidLock,
+	SyncRepLock,
 	/* Individual lock IDs end here */
 	FirstBufMappingLock,
 	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 78dbade..091b213 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -14,6 +14,8 @@
 #ifndef _PROC_H_
 #define _PROC_H_
 
+#include "access/xlog.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "storage/pg_sema.h"
 #include "utils/timestamp.h"
@@ -115,6 +117,12 @@ struct PGPROC
 	LOCKMASK	heldLocks;		/* bitmask for lock types already held on this
 								 * lock object by this backend */
 
+	/* Info to allow us to wait for synchronous replication, if needed. */
+	Latch		waitLatch;
+	XLogRecPtr	waitLSN;			/* waiting for this LSN or higher */
+	
+	SHM_QUEUE	syncrep_links;	/* list link if process is in syncrep list */
+
 	/*
 	 * All PROCLOCK objects for locks held or awaited by this backend are
 	 * linked into one of these lists, according to the partition number of
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index f23740c..0b7da77 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -67,8 +67,11 @@ extern void SHMQueueInit(SHM_QUEUE *queue);
 extern void SHMQueueElemInit(SHM_QUEUE *queue);
 extern void SHMQueueDelete(SHM_QUEUE *queue);
 extern void SHMQueueInsertBefore(SHM_QUEUE *queue, SHM_QUEUE *elem);
+extern void SHMQueueInsertAfter(SHM_QUEUE *queue, SHM_QUEUE *elem);
 extern Pointer SHMQueueNext(const SHM_QUEUE *queue, const SHM_QUEUE *curElem,
 			 Size linkOffset);
+extern Pointer SHMQueuePrev(const SHM_QUEUE *queue, const SHM_QUEUE *curElem,
+			 Size linkOffset);
 extern bool SHMQueueEmpty(const SHM_QUEUE *queue);
 extern bool SHMQueueIsDetached(const SHM_QUEUE *queue);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 02043ab..20cdc39 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1298,7 +1298,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
  pg_stat_bgwriter                | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
  pg_stat_database                | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
  pg_stat_database_conflicts      | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
- pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
+ pg_stat_replication             | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_hostname, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.replay_location, w.sync_priority, w.sync_state FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_hostname, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, replay_location, sync_priority, sync_state) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
  pg_stat_sys_indexes             | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
  pg_stat_sys_tables              | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
  pg_stat_user_functions          | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));

#47

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#40)

Re: Sync Rep v19

On Sun, 2011-03-06 at 00:42 +0900, Fujii Masao wrote:

On Sat, Mar 5, 2011 at 9:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I've added code to shmqueue.c to allow this.

New version pushed.

New comments;

None of the requested changes are in v21, as yet.

It looks odd to report the sync_state of walsender in BACKUP
state as ASYNC.

Cool.

+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+	if (WaitingForSyncRep && !SHMQueueIsDetached(&(MyProc->syncrep_links)))
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SHMQueueDelete(&(MyProc->syncrep_links));
+		LWLockRelease(SyncRepLock);
+	}
+
+	if (MyProc != NULL)
+		DisownLatch(&MyProc->waitLatch);

Can MyProc really be NULL here? If yes, "MyProc != NULL" should be
checked before seeing MyProc->syncrep_links.

Even though postmaster dies, the waiting backend keeps waiting until
the timeout expires. Instead, the backends should periodically check
whether postmaster is alive, and then they should exit immediately
if it's not alive, as well as other process does? If the timeout is
disabled, such backends would get stuck infinitely.

Will wake them every 60 seconds

Though I commented about the issue related to shutdown, that was
pointless. So change of ProcessInterrupts is not required unless we
find the need again. Sorry for the noise..

Yep, all gone now.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#48

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#40)

Re: Sync Rep v19

On Sun, Mar 6, 2011 at 12:42 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

New comments;

Another one;

+	long		timeout = SyncRepGetWaitTimeout();
<snip>
+			else if (timeout > 0 &&
+				TimestampDifferenceExceeds(wait_start, now, timeout))
+			{

The third argument of TimestampDifferenceExceeds is
expressed in milliseconds. But you wrongly give that the
microseconds.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#49

Yeb Havinga

yebhavinga@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#46)

Re: Sync Rep v19

On Sat, Mar 5, 2011 at 5:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On a positive note this is one less parameter and will improve
performance as well.

All above changes made.

Ready to commit, barring concrete objections to important behaviour.

I will do one final check tomorrow evening then commit.

Will retest with new version this evening. Also curious to performance
improvement, since v17 seems to be topscorer in that department.

regardd,
Yeb Havinga

#50

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#47)

Re: Sync Rep v19

On Sat, Mar 5, 2011 at 11:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Even though postmaster dies, the waiting backend keeps waiting until
the timeout expires. Instead, the backends should periodically check
whether postmaster is alive, and then they should exit immediately
if it's not alive, as well as other process does? If the timeout is
disabled, such backends would get stuck infinitely.

Will wake them every 60 seconds

I don't really see why sync rep should be responsible for solving this
problem, which is an issue in many other situations as well, only for
itself. In fact I think I'd prefer that it didn't, and that we wait
for a more general solution that will actually fix this problem for
real.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#51

Yeb Havinga

yebhavinga@gmail.com

almost 15 years ago

In reply to: Yeb Havinga (#49)

Re: Sync Rep v19

On 2011-03-05 18:25, Yeb Havinga wrote:

On Sat, Mar 5, 2011 at 5:53 PM, Simon Riggs <simon@2ndquadrant.com
<mailto:simon@2ndquadrant.com>> wrote:

On a positive note this is one less parameter and will improve
performance as well.

All above changes made.

Ready to commit, barring concrete objections to important behaviour.

I will do one final check tomorrow evening then commit.

Will retest with new version this evening. Also curious to performance
improvement, since v17 seems to be topscorer in that department.

Summary of preliminary testing:
1) it is confusing to show messages/ contents of stat_replication that
hints at syncrep, when synchronous_replication is on.
2) should guc settings for synchronous_replication be changed so it can
only be set in the config file, and hence only change with reload_conf()?
3) speed is comparable to v17 :-)

regards,
Yeb Havinga

So the biggest change is perhaps that you cannot start 'working'
immediately after a initdb with synchronous_replication=on, without a
connected standby; I needed to create a role for the repuser to make a
backup, but the master halted. The initial bootstrapping has to be done
with synchronous_replication = off. So I did, made backup, started
standby's while still in not synchronous mode. What followed was confusing:

LOG: 00000: standby "standby2" is now the synchronous standby with
priority 2

postgres=# show synchronous_replication ; show
synchronous_standby_names; select application_name,state,sync_state from
pg_stat_replication;
synchronous_replication
-------------------------
off
(1 row)
synchronous_standby_names
----------------------------
standby1,standby2,standby3
(1 row)
application_name | state | sync_state
------------------+-----------+------------
standby2 | STREAMING | SYNC
asyncone | STREAMING | ASYNC
(2 rows)

Is it really sync?
pgbench test got 1464 tps.. seems a bit high.

psql postgres, set synchronous_replication = on;
- no errors, and show after disconnect showed this parameter was still
on. My guess: we have syncrep! A restart or reload config was not necessary.
pgbench test got 1460 tps.

pg_reload_conf(); with syncrep = on in postgresql.conf
pgbench test got 819 tps

So now this is synchronous.
Disabled the asynchronous standby
pgbench test got 920 tps.

I also got a first first > 1000 tps score :-) (yeah you have to believe
me there really was a sync standby server)

$ pgbench -c 10 -M prepared -T 30 test
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 50
query mode: prepared
number of clients: 10
number of threads: 1
duration: 30 s
number of transactions actually processed: 30863
tps = 1027.493807 (including connections establishing)
tps = 1028.183618 (excluding connections establishing)

#52

Yeb Havinga

yebhavinga@gmail.com

almost 15 years ago

In reply to: Yeb Havinga (#51)

Re: Sync Rep v19

On 2011-03-05 21:11, Yeb Havinga wrote:

Summary of preliminary testing:
1) it is confusing to show messages/ contents of stat_replication that
hints at syncrep, when synchronous_replication is on.

s/on/off/

Also forgot to mention these tests are againt the latest v21 syncrep patch.

#53

Jaime Casanova

jaime@2ndquadrant.com

almost 15 years ago

In reply to: Yeb Havinga (#51)

Re: Sync Rep v19

On Sat, Mar 5, 2011 at 3:11 PM, Yeb Havinga <yebhavinga@gmail.com> wrote:

Summary of preliminary testing:
1) it is confusing to show messages/ contents of stat_replication that hints
at syncrep, when synchronous_replication is on.

[for the record, Yeb explain he means OFF not on...]

the thing is that once you put a server name in
synchronous_standby_names that standby is declared to be synchronous
and because synchronous_replication can change at any time for any
given backend we need to know priority of any declared sync standby.
if you want pure async standbys remove their names from
synchronous_standby_names

2) should guc settings for synchronous_replication be changed so it can only
be set in the config file, and hence only change with reload_conf()?

no. the thing is that we can determine for which data we want to pay
the price of synchrony and for which data don't

3) speed is comparable to v17 :-)

yeah... it's a lot better than before, good work Simon :)

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

#54

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Robert Haas (#50)

Re: Sync Rep v19

On Sun, Mar 6, 2011 at 2:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Mar 5, 2011 at 11:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Even though postmaster dies, the waiting backend keeps waiting until
the timeout expires. Instead, the backends should periodically check
whether postmaster is alive, and then they should exit immediately
if it's not alive, as well as other process does? If the timeout is
disabled, such backends would get stuck infinitely.

Will wake them every 60 seconds

I don't really see why sync rep should be responsible for solving this
problem, which is an issue in many other situations as well, only for
itself. In fact I think I'd prefer that it didn't, and that we wait
for a more general solution that will actually fix this problem for
real.

I agree if such a general solution will be committed together with sync rep.
Otherwise, because of sync rep, the backend can easily get stuck *infinitely*.
When postmaster is not alive, all the existing walsenders exit immediately
and no new walsender can appear. So there is no way to release the
waiting backend. I think that some solutions for this issue which is likely to
happen are required.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#55

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#46)

1 attachment(s)

Re: Sync Rep v19

On Sun, Mar 6, 2011 at 1:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sat, 2011-03-05 at 20:08 +0900, Fujii Masao wrote:

On Sat, Mar 5, 2011 at 7:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Yes, that can happen. As people will no doubt observe, this seems to be
an argument for wait-forever. What we actually need is a wait that lasts
longer than it takes for us to decide to failover, if the standby is
actually up and this is some kind of split brain situation. That way the
clients are still waiting when failover occurs. WAL is missing, but
since we didn't acknowledge the client we are OK to treat that situation
as if it were an abort.

Oracle Data Guard in the maximum availability mode behaves that way?

I'm sure that you are implementing something like the maximum availability
mode rather than the maximum protection one. So I'd like to know how
the data loss situation I described can be avoided in the maximum availability
mode.

It can't. (Oracle or otherwise...)

Once we begin waiting for sync rep, if the transaction or backend ends
then other backends will be able to see the changed data. The only way
to prevent that is to shutdown the database to ensure that no readers or
writers have access to that.

Oracle's protection mechanism is to shutdown the primary if there is no
sync standby available. Maximum Protection. Any other mode must
therefore be less than maximum protection, according to Oracle, and me.
"Available" here means one that has not timed out, via parameter.

Shutting down the main server is cool, as long as you failover to one of
the standbys. If there aren't any standbys, or you don't have a
mechanism for switching quickly, you have availability problems.

What shutting down the server doesn't do is keep the data safe for
transactions that were in their commit-wait phase when the disconnect
occurs. That data exists, yet will not have been transferred to the
standby.

From now, I also say we should wait forever. It is the safest mode and I

want no argument about whether sync rep is safe or not. We can introduce
a more relaxed mode later with high availability for the primary. That
is possible and in some cases desirable.

Now, when we lose last sync standby we have three choices:

1. reconnect the standby, or wait for a potential standby to catchup

2. immediate shutdown of master and failover to one of the standbys

3. reclassify an async standby as a sync standby

More than likely we would attempt to do (1) for a while, then do (2)

This means that when we startup the primary will freeze for a while
until the sync standbys connect. Which is OK, since they try to
reconnect every 5 seconds and we don't plan on shutting down the primary
much anyway.

I've removed the timeout parameter, plus if we begin waiting we wait
until released, forever if that's how long it takes.

The recommendation to use more than one standby remains.

Fast shutdown will wake backends from their latch and there isn't any
changed interrupt behaviour any more.

synchronous_standby_names = '*' is no longer the default

On a positive note this is one less parameter and will improve
performance as well.

All above changes made.

Ready to commit, barring concrete objections to important behaviour.

I will do one final check tomorrow evening then commit.

I agree with this change.

One comment; what about introducing built-in function to wake up all the
waiting backends? When replication connection is closed, if we STONITH
the standby, we can safely (for not physical data loss but logical one)
switch the primary to standalone mode. But there is no way to wake up
the waiting backends for now. Setting synchronous_replication to OFF
and reloading the configuration file doesn't affect the existing waiting
backends. The attached patch introduces the "pg_wakeup_all_waiters"
(better name?) function which wakes up all the backends on the queue.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

wakeup_waiters_v1.patchtext/x-diff; charset=US-ASCII; name=wakeup_waiters_v1.patchDownload

*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 13905,13910 **** SELECT set_config('log_statement_stats', 'off', false);
--- 13905,13917 ----
         <entry><type>boolean</type></entry>
         <entry>Terminate a backend</entry>
        </row>
+       <row>
+        <entry>
+         <literal><function>pg_wakeup_all_waiters()</function></literal>
+         </entry>
+        <entry><type>void</type></entry>
+        <entry>Wake up all the backends waiting for replication</entry>
+       </row>
       </tbody>
      </tgroup>
     </table>
***************
*** 13939,13944 **** SELECT set_config('log_statement_stats', 'off', false);
--- 13946,13956 ----
      subprocess.
     </para>
  
+    <para>
+     <function>pg_wakeup_all_waiters</> signals all the backends waiting
+     for replication, to wake up and complete the transaction.
+    </para>
+ 
     <indexterm>
      <primary>backup</primary>
     </indexterm>
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 72,78 **** static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN);
  static void SyncRepQueueInsert(void);
  
  static int SyncRepGetStandbyPriority(void);
! static int SyncRepWakeQueue(void);
  
  /*
   * ===========================================================
--- 72,78 ----
  static void SyncRepQueueInsert(void);
  
  static int SyncRepGetStandbyPriority(void);
! static int SyncRepWakeQueue(bool wakeup_all);
  
  /*
   * ===========================================================
***************
*** 180,196 **** SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN)
  			 * unlikely to be far enough, yet is possible. Next time we are
  			 * woken we should be more lucky.
  			 */
! 			if (XLByteLE(XactCommitLSN, walsndctl->lsn))
  			{
  				SHMQueueDelete(&(MyProc->syncrep_links));
  				LWLockRelease(SyncRepLock);
  
- 				/*
- 				 * Reset our waitLSN.
- 				 */
- 				MyProc->waitLSN.xlogid = 0;
- 				MyProc->waitLSN.xrecoff = 0;
- 
  				if (new_status)
  				{
  					/* Reset ps display */
--- 180,192 ----
  			 * unlikely to be far enough, yet is possible. Next time we are
  			 * woken we should be more lucky.
  			 */
! 			if (XLByteLE(XactCommitLSN, walsndctl->lsn) ||
! 				(MyProc->waitLSN.xlogid == 0 &&
! 				 MyProc->waitLSN.xrecoff == 0))
  			{
  				SHMQueueDelete(&(MyProc->syncrep_links));
  				LWLockRelease(SyncRepLock);
  
  				if (new_status)
  				{
  					/* Reset ps display */
***************
*** 347,353 **** SyncRepReleaseWaiters(void)
  		 * release up to this location.
  		 */
  		walsndctl->lsn = MyWalSnd->flush;
! 		numprocs = SyncRepWakeQueue();
  	}
  
  	LWLockRelease(SyncRepLock);
--- 343,349 ----
  		 * release up to this location.
  		 */
  		walsndctl->lsn = MyWalSnd->flush;
! 		numprocs = SyncRepWakeQueue(false);
  	}
  
  	LWLockRelease(SyncRepLock);
***************
*** 427,436 **** SyncRepGetStandbyPriority(void)
   * to be woken. We don't modify the queue, we leave that for individual
   * procs to release themselves.
   *
   * Must hold SyncRepLock
   */
  static int
! SyncRepWakeQueue(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
  	PGPROC	*proc;
--- 423,434 ----
   * to be woken. We don't modify the queue, we leave that for individual
   * procs to release themselves.
   *
+  * If 'wakeup_all' is true, set the latches of all procs in the queue.
+  *
   * Must hold SyncRepLock
   */
  static int
! SyncRepWakeQueue(bool wakeup_all)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
  	PGPROC	*proc;
***************
*** 445,454 **** SyncRepWakeQueue(void)
  		/*
  		 * Assume the queue is ordered by LSN
  		 */
! 		if (XLByteLT(walsndctl->lsn, proc->waitLSN))
  			return numprocs;
  
  		numprocs++;
  		SetLatch(&(proc->waitLatch));
  		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
  									   &(proc->syncrep_links),
--- 443,454 ----
  		/*
  		 * Assume the queue is ordered by LSN
  		 */
! 		if (!wakeup_all && XLByteLT(walsndctl->lsn, proc->waitLSN))
  			return numprocs;
  
  		numprocs++;
+ 		proc->waitLSN.xlogid = 0;
+ 		proc->waitLSN.xrecoff = 0;
  		SetLatch(&(proc->waitLatch));
  		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
  									   &(proc->syncrep_links),
***************
*** 457,459 **** SyncRepWakeQueue(void)
--- 457,477 ----
  
  	return numprocs;
  }
+ 
+ /*
+  * Wake up all the waiting backends
+  */
+ Datum
+ pg_wakeup_all_waiters(PG_FUNCTION_ARGS)
+ {
+ 	if (!superuser())
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ 			(errmsg("must be superuser to signal other server processes"))));
+ 
+ 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 	SyncRepWakeQueue(true);
+ 	LWLockRelease(SyncRepLock);
+ 
+ 	PG_RETURN_VOID();
+ }
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 2869,2874 **** DATA(insert OID = 3821 ( pg_last_xlog_replay_location	PGNSP PGUID 12 1 0 0 f f f
--- 2869,2876 ----
  DESCR("last xlog replay location");
  DATA(insert OID = 3830 ( pg_last_xact_replay_timestamp	PGNSP PGUID 12 1 0 0 f f f t f v 0 0 1184 "" _null_ _null_ _null_ _null_ pg_last_xact_replay_timestamp _null_ _null_ _null_ ));
  DESCR("timestamp of last replay xact");
+ DATA(insert OID = 3831 ( pg_wakeup_all_waiters		PGNSP PGUID 12 1 0 0 f f f t f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_wakeup_all_waiters _null_ _null_ _null_ ));
+ DESCR("wake up all waiters");
  
  DATA(insert OID = 3071 ( pg_xlog_replay_pause		PGNSP PGUID 12 1 0 0 f f f t f v 0 0 2278 "" _null_ _null_ _null_ _null_ pg_xlog_replay_pause _null_ _null_ _null_ ));
  DESCR("pause xlog replay");
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 34,37 **** extern void SyncRepCleanupAtProcExit(int code, Datum arg);
--- 34,40 ----
  extern void SyncRepInitConfig(void);
  extern void SyncRepReleaseWaiters(void);
  
+ /* system administration functions */
+ extern Datum pg_wakeup_all_waiters(PG_FUNCTION_ARGS);
+ 
  #endif   /* _SYNCREP_H */

#56

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#55)

Re: Sync Rep v19

On Sun, Mar 6, 2011 at 4:51 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

One comment; what about introducing built-in function to wake up all the
waiting backends? When replication connection is closed, if we STONITH
the standby, we can safely (for not physical data loss but logical one)
switch the primary to standalone mode. But there is no way to wake up
the waiting backends for now. Setting synchronous_replication to OFF
and reloading the configuration file doesn't affect the existing waiting
backends. The attached patch introduces the "pg_wakeup_all_waiters"
(better name?) function which wakes up all the backends on the queue.

If unfortunately all connection slots are used by backends waiting for
replication, we cannot execute such a function. So it makes more sense
to introduce something like "pg_ctl standalone" command?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#57

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#54)

Re: Sync Rep v19

On Sun, 2011-03-06 at 14:27 +0900, Fujii Masao wrote:

On Sun, Mar 6, 2011 at 2:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Mar 5, 2011 at 11:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Even though postmaster dies, the waiting backend keeps waiting until
the timeout expires. Instead, the backends should periodically check
whether postmaster is alive, and then they should exit immediately
if it's not alive, as well as other process does? If the timeout is
disabled, such backends would get stuck infinitely.

Will wake them every 60 seconds

I don't really see why sync rep should be responsible for solving this
problem, which is an issue in many other situations as well, only for
itself. In fact I think I'd prefer that it didn't, and that we wait
for a more general solution that will actually fix this problem for
real.

I agree if such a general solution will be committed together with sync rep.
Otherwise, because of sync rep, the backend can easily get stuck *infinitely*.
When postmaster is not alive, all the existing walsenders exit immediately
and no new walsender can appear. So there is no way to release the
waiting backend. I think that some solutions for this issue which is likely to
happen are required.

Completely agree.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#58

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#56)

Re: Sync Rep v19

On Sun, 2011-03-06 at 16:58 +0900, Fujii Masao wrote:

On Sun, Mar 6, 2011 at 4:51 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

One comment; what about introducing built-in function to wake up all the
waiting backends? When replication connection is closed, if we STONITH
the standby, we can safely (for not physical data loss but logical one)
switch the primary to standalone mode. But there is no way to wake up
the waiting backends for now. Setting synchronous_replication to OFF
and reloading the configuration file doesn't affect the existing waiting
backends. The attached patch introduces the "pg_wakeup_all_waiters"
(better name?) function which wakes up all the backends on the queue.

If unfortunately all connection slots are used by backends waiting for
replication, we cannot execute such a function. So it makes more sense
to introduce something like "pg_ctl standalone" command?

Well, there is one way to end the wait: shutdown, or use
pg_terminate_backend().

If you simply end the wait you will get COMMIT messages.

What I would like to do is commit the "safe" patch now. We can then
discuss whether it is safe and desirable to relax some aspects of that
during beta.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#59

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#48)

Re: Sync Rep v19

On Sun, 2011-03-06 at 01:58 +0900, Fujii Masao wrote:

On Sun, Mar 6, 2011 at 12:42 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

New comments;

Another one;
+	long		timeout = SyncRepGetWaitTimeout();
<snip>
+			else if (timeout > 0 &&
+				TimestampDifferenceExceeds(wait_start, now, timeout))
+			{
The third argument of TimestampDifferenceExceeds is
expressed in milliseconds. But you wrongly give that the
microseconds.

Just for the record: that code section is now removed in v21

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#60

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#58)

Re: Sync Rep v19

On Sun, Mar 6, 2011 at 5:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sun, 2011-03-06 at 16:58 +0900, Fujii Masao wrote:

On Sun, Mar 6, 2011 at 4:51 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

One comment; what about introducing built-in function to wake up all the
waiting backends? When replication connection is closed, if we STONITH
the standby, we can safely (for not physical data loss but logical one)
switch the primary to standalone mode. But there is no way to wake up
the waiting backends for now. Setting synchronous_replication to OFF
and reloading the configuration file doesn't affect the existing waiting
backends. The attached patch introduces the "pg_wakeup_all_waiters"
(better name?) function which wakes up all the backends on the queue.

If unfortunately all connection slots are used by backends waiting for
replication, we cannot execute such a function. So it makes more sense
to introduce something like "pg_ctl standalone" command?

Well, there is one way to end the wait: shutdown, or use
pg_terminate_backend().

Immediate shutdown can release the wait. But smart and fast shutdown
cannot. Also pg_terminate_backend() cannot. Since a backend is waiting
on the latch and InterruptHoldoffCount is not zero, only SetLatch() or
SIGQUIT can cause it to end.

If you simply end the wait you will get COMMIT messages.

What I would like to do is commit the "safe" patch now. We can then
discuss whether it is safe and desirable to relax some aspects of that
during beta.

OK if changing some aspects is acceptable during beta.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#61

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#1)

Re: Sync Rep v19

On Sun, Mar 6, 2011 at 5:02 PM, Yeb Havinga <yebhavinga@gmail.com> wrote:

On Sun, Mar 6, 2011 at 8:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

If unfortunately all connection slots are used by backends waiting for
replication, we cannot execute such a function. So it makes more sense
to introduce something like "pg_ctl standalone" command?

If it is only for shutdown, maybe pg_ctl stop -m standalone?

It's for not only shutdown but also running the primary in standalone mode.
So something like "pg_ctl standalone" is better.

For now I think that pg_ctl command is better than built-in function because
sometimes we might want to wake waiters up even during shutdown in
order to cause shutdown to end. During shutdown, the server doesn't
accept any new connection (even from the standby). So, without something
like "pg_ctl standalone", there is no way to cause shutdown to end.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Import Notes

Reply to msg id not found: AANLkTimujhVZdo5B1mnSMkSxCkAMXi4jBL9K-4e1YYNu@mail.gmail.com

#62

Jaime Casanova

jaime@2ndquadrant.com

almost 15 years ago

In reply to: Simon Riggs (#58)

Re: Sync Rep v19

El 06/03/2011 03:26, "Simon Riggs" <simon@2ndquadrant.com> escribió:

On Sun, 2011-03-06 at 16:58 +0900, Fujii Masao wrote:

If unfortunately all connection slots are used by backends waiting for
replication, we cannot execute such a function. So it makes more sense
to introduce something like "pg_ctl standalone" command?

Well, there is one way to end the wait: shutdown, or use
pg_terminate_backend().

I disconnected all standbys so the master keeps waiting on commit. Then i
shutdown the master with immediate and got a crash. i wasn't able to trace
that though

--
Jaime Casanova www.2ndQuadrant.com

#63

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#61)

Re: Sync Rep v19

On Mar 6, 2011, at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Mar 6, 2011 at 5:02 PM, Yeb Havinga <yebhavinga@gmail.com> wrote:

On Sun, Mar 6, 2011 at 8:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

If unfortunately all connection slots are used by backends waiting for
replication, we cannot execute such a function. So it makes more sense
to introduce something like "pg_ctl standalone" command?

If it is only for shutdown, maybe pg_ctl stop -m standalone?

It's for not only shutdown but also running the primary in standalone mode.
So something like "pg_ctl standalone" is better.

For now I think that pg_ctl command is better than built-in function because
sometimes we might want to wake waiters up even during shutdown in
order to cause shutdown to end. During shutdown, the server doesn't
accept any new connection (even from the standby). So, without something
like "pg_ctl standalone", there is no way to cause shutdown to end.

This sounds like an awful hack to work around a bad design. Surely once shutdown reaches a point where new replication connections can no longer be accepted, any standbys hung on commit need to close the connection without responding to the COMMIT, per previous discussion. It's completely unreasonable for sync rep to break the shutdown sequence.

...Robert

#64

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#55)

Re: Sync Rep v19

On Sun, 2011-03-06 at 16:51 +0900, Fujii Masao wrote:

One comment; what about introducing built-in function to wake up all the
waiting backends? When replication connection is closed, if we STONITH
the standby, we can safely (for not physical data loss but logical one)
switch the primary to standalone mode. But there is no way to wake up
the waiting backends for now. Setting synchronous_replication to OFF
and reloading the configuration file doesn't affect the existing waiting
backends. The attached patch introduces the "pg_wakeup_all_waiters"
(better name?) function which wakes up all the backends on the queue.

Will apply this as a separate commit.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#65

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Yeb Havinga (#51)

Re: Sync Rep v19

On Sat, 2011-03-05 at 21:11 +0100, Yeb Havinga wrote:

I also got a first first > 1000 tps score

The committed version should be even faster. Would appreciate a retest.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#66

Yeb Havinga

yebhavinga@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#65)

Re: Sync Rep v19

On 2011-03-07 01:37, Simon Riggs wrote:

On Sat, 2011-03-05 at 21:11 +0100, Yeb Havinga wrote:

I also got a first first> 1000 tps score

The committed version should be even faster. Would appreciate a retest.

pgbench 5 minute test pgbench -c 10 -M prepared -T 300 test
dbsize was -s 50, 1Gbit Ethernet

1 async standby
tps = 2475.285931 (excluding connections establishing)

2 async standbys
tps = 2333.670561 (excluding connections establishing)

1 sync standby
tps = 1277.082753 (excluding connections establishing)

1 sync, 1 async standby
tps = 1273.317386 (excluding connections establishing)

Hard for me to not revert to superlatives right now! :-)

regards,
Yeb Havinga

#67

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Yeb Havinga (#66)

Re: Sync Rep v19

On Mon, 2011-03-07 at 14:20 +0100, Yeb Havinga wrote:

On 2011-03-07 01:37, Simon Riggs wrote:

On Sat, 2011-03-05 at 21:11 +0100, Yeb Havinga wrote:

I also got a first first> 1000 tps score

The committed version should be even faster. Would appreciate a retest.

pgbench 5 minute test pgbench -c 10 -M prepared -T 300 test
dbsize was -s 50, 1Gbit Ethernet

1 async standby
tps = 2475.285931 (excluding connections establishing)

2 async standbys
tps = 2333.670561 (excluding connections establishing)

1 sync standby
tps = 1277.082753 (excluding connections establishing)

1 sync, 1 async standby
tps = 1273.317386 (excluding connections establishing)

Hard for me to not revert to superlatives right now! :-)

That looks like good news, thanks.

It shows that sync rep is "fairly fast", but it also shows clearly why
you'd want to mix sync and async replication within an application.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#68

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Robert Haas (#63)

Re: Sync Rep v19

On Mon, Mar 7, 2011 at 4:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mar 6, 2011, at 9:44 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Mar 6, 2011 at 5:02 PM, Yeb Havinga <yebhavinga@gmail.com> wrote:

On Sun, Mar 6, 2011 at 8:58 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

If unfortunately all connection slots are used by backends waiting for
replication, we cannot execute such a function. So it makes more sense
to introduce something like "pg_ctl standalone" command?

If it is only for shutdown, maybe pg_ctl stop -m standalone?

It's for not only shutdown but also running the primary in standalone mode.
So something like "pg_ctl standalone" is better.

For now I think that pg_ctl command is better than built-in function because
sometimes we might want to wake waiters up even during shutdown in
order to cause shutdown to end. During shutdown, the server doesn't
accept any new connection (even from the standby). So, without something
like "pg_ctl standalone", there is no way to cause shutdown to end.

This sounds like an awful hack to work around a bad design. Surely once shutdown reaches a point where new replication connections can no longer be accepted, any standbys hung on commit need to close the connection without responding to the COMMIT, per previous discussion. It's completely unreasonable for sync rep to break the shutdown sequence.

Yeah, let's think about how shutdown should work. I'd like to propose the
following. Thought?

* Smart shutdown
Smart shutdown should wait for all the waiting backends to be acked, and
should not cause them to forcibly exit. But this leads shutdown to get stuck
infinitely if there is no walsender at that time. To enable them to be acked
even in that situation, we need to change postmaster so that it accepts the
replication connection even during smart shutdown (until we reach
PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
connection to cancel backup during smart shutdown. So I don't think that
the idea to accept the replication connection during smart shutdown is so
ugly.

* Fast shutdown
I agree with you about fast shutdown. Fast shutdown should cause all the
backends including waiting ones to exit immediately. At that time, the
non-acked backend should not return the success, according to the
definition of sync rep. So we need to change a backend so that it gets rid
of itself from the waiting queue and exits before returning the success,
when it receives SIGTERM. This change leads the waiting backends to
do the same even when pg_terminate_backend is called. But since
they've not been acked yet, it seems to be reasonable to prevent them
from returning the COMMIT.

Comments? I'll create the patch barring objection.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#69

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#68)

Re: Sync Rep v19

On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Yeah, let's think about how shutdown should work. I'd like to propose the
following. Thought?

* Smart shutdown
Smart shutdown should wait for all the waiting backends to be acked, and
should not cause them to forcibly exit. But this leads shutdown to get stuck
infinitely if there is no walsender at that time. To enable them to be acked
even in that situation, we need to change postmaster so that it accepts the
replication connection even during smart shutdown (until we reach
PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
connection to cancel backup during smart shutdown. So I don't think that
the idea to accept the replication connection during smart shutdown is so
ugly.

* Fast shutdown
I agree with you about fast shutdown. Fast shutdown should cause all the
backends including waiting ones to exit immediately. At that time, the
non-acked backend should not return the success, according to the
definition of sync rep. So we need to change a backend so that it gets rid
of itself from the waiting queue and exits before returning the success,
when it receives SIGTERM. This change leads the waiting backends to
do the same even when pg_terminate_backend is called. But since
they've not been acked yet, it seems to be reasonable to prevent them
from returning the COMMIT.

The fast shutdown handling seems fine, but why not just handle smart
shutdown the same way? I don't really like the idea of allowing
replication connections for longer, and the idea that we don't want to
keep waiting for a commit ACK once we're past the point where it's
possible for one to occur seems to apply generically to any shutdown
sequence.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#70

Jaime Casanova

jaime@2ndquadrant.com

almost 15 years ago

In reply to: Robert Haas (#69)

Re: Sync Rep v19

On Tue, Mar 8, 2011 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The fast shutdown handling seems fine, but why not just handle smart
shutdown the same way?

currently, smart shutdown means no new connections, wait until
existing ones close normally. for consistency, it should behave the
same for sync rep.

+1 for Fujii's proposal

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL

#71

Bruce Momjian

bruce@momjian.us

almost 15 years ago

In reply to: Simon Riggs (#23)

Re: Sync Rep v19

Simon Riggs wrote:

On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:

postgres=# SELECT application_name, state, sync_priority, sync_state
FROM pg_stat_replication;
application_name | state | sync_priority | sync_state
------------------+-----------+---------------+------------
one | STREAMING | 1 | POTENTIAL
two | STREAMING | 2 | SYNC
(2 rows)

Bug! Thanks.

Is there a reason these status are all upper-case?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#72

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Bruce Momjian (#71)

Re: Sync Rep v19

On Wed, 2011-03-09 at 21:21 -0500, Bruce Momjian wrote:

Simon Riggs wrote:

On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:

postgres=# SELECT application_name, state, sync_priority, sync_state
FROM pg_stat_replication;
application_name | state | sync_priority | sync_state
------------------+-----------+---------------+------------
one | STREAMING | 1 | POTENTIAL
two | streaming | 2 | sync
(2 rows)

Bug! Thanks.

Is there a reason these status are all upper-case?

NOT AS FAR AS I KNOW.

I'll add it to the list of changes for beta.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#73

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Bruce Momjian (#71)

Re: Sync Rep v19

On Wed, Mar 9, 2011 at 9:21 PM, Bruce Momjian <bruce@momjian.us> wrote:

Simon Riggs wrote:

On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:

postgres=# SELECT application_name, state, sync_priority, sync_state
FROM pg_stat_replication;
application_name | state | sync_priority | sync_state
------------------+-----------+---------------+------------
one | STREAMING | 1 | POTENTIAL
two | STREAMING | 2 | SYNC
(2 rows)

Bug! Thanks.

Is there a reason these status are all upper-case?

Not that I know of.

However, I think that some more fundamental rethinking of the "state"
mechanism may be in order. When Magnus first committed this, it would
say CATCHUP whenever you were behind (even if only momentarily) and
STREAMING if you were caught up. Simon then changed it so that it
says CATCHUP until you catch up the first time, and then STREAMING
afterward (even if you fall behind again). Neither behavior seems
completely adequate to me. I think we should have a way to know
whether we've ever been caught up, and if so when the most recent time
was. So you could then say things like "is the most recent time at
which the standby was caught up within the last 30 seconds?", which
would be a useful thing to monitor, and right now there's no way to do
it. There's also a BACKUP state, but I'm not sure it makes sense to
lump that in with the others. Some day it might be possible to stream
WAL and take a backup at the same time, over the same connection.
Maybe that should be a separate column or something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#74

Dimitri Fontaine

dimitri@2ndQuadrant.fr

almost 15 years ago

In reply to: Robert Haas (#73)

Re: Sync Rep v19

Robert Haas <robertmhaas@gmail.com> writes:

was. So you could then say things like "is the most recent time at
which the standby was caught up within the last 30 seconds?", which
would be a useful thing to monitor, and right now there's no way to do

Well in my experience with replication, that's not what I want to
monitor. If the standby is synchronous, then it's not catching up, it's
streaming. If it were not, it would not be a synchronous standby.

When a standby is asynchronous then what I want to monitor is its lag.

So the CATCHUP state is useful to see that a synchronous standby
candidate can not yet be a synchronous standby. When it just lost its
synchronous status (and hopefully another standby is now the sync one),
then it's just asynchronous and I want to know its lag.

it. There's also a BACKUP state, but I'm not sure it makes sense to
lump that in with the others. Some day it might be possible to stream
WAL and take a backup at the same time, over the same connection.
Maybe that should be a separate column or something.

BACKUP is still meaningful if you stream WAL at the same time, because
you're certainly *not* applying them while doing the base backup, are
you? So you're not yet a standby, that's what BACKUP means.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#75

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Dimitri Fontaine (#74)

Re: Sync Rep v19

On Thu, Mar 10, 2011 at 2:42 PM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

was. So you could then say things like "is the most recent time at
which the standby was caught up within the last 30 seconds?", which
would be a useful thing to monitor, and right now there's no way to do

Well in my experience with replication, that's not what I want to
monitor. If the standby is synchronous, then it's not catching up, it's
streaming. If it were not, it would not be a synchronous standby.

When a standby is asynchronous then what I want to monitor is its lag.

So the CATCHUP state is useful to see that a synchronous standby
candidate can not yet be a synchronous standby. When it just lost its
synchronous status (and hopefully another standby is now the sync one),
then it's just asynchronous and I want to know its lag.

Yeah, maybe. The trick is how to measure the lag. I proposed the
above scheme mostly as a way of giving the user some piece of
information that can be measured in seconds rather than WAL position,
but I'm open to better ideas. Monitoring is pretty hard to do at all
in 9.0; in 9.1, we'll be able to tell them how many *bytes* behind
they are, but there's no easy way to figure out what that means in
terms of wall-clock time, which I think would be useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#76

Dimitri Fontaine

dimitri@2ndQuadrant.fr

almost 15 years ago

In reply to: Robert Haas (#75)

Re: Sync Rep v19

Robert Haas <robertmhaas@gmail.com> writes:

they are, but there's no easy way to figure out what that means in
terms of wall-clock time, which I think would be useful.

Jan Wieck had a detailed proposal to make that happen at last developper
meeting, but then ran out of time to implement it for 9.1 it seems. The
idea was basically to have a ticker in core, an SRF that would associate
txid_snapshot with wall clock time. Lots of good things would come from
that.

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php

Of course if you think that's important enough for you to implement it
between now and beta, that would be great :)

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#77

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Dimitri Fontaine (#76)

Re: Sync Rep v19

On Thu, Mar 10, 2011 at 3:29 PM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

they are, but there's no easy way to figure out what that means in
terms of wall-clock time, which I think would be useful.

Jan Wieck had a detailed proposal to make that happen at last developper
meeting, but then ran out of time to implement it for 9.1 it seems. The
idea was basically to have a ticker in core, an SRF that would associate
txid_snapshot with wall clock time. Lots of good things would come from
that.

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php

Of course if you think that's important enough for you to implement it
between now and beta, that would be great :)

I think that's actually something a little different, and more
complicated, but I do think it'd be useful. I was hoping there was a
simple way to get some kind of time-based information into
pg_stat_replication, but if there isn't, there isn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#78

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Robert Haas (#77)

Re: Sync Rep v19

On Fri, Mar 11, 2011 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 10, 2011 at 3:29 PM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

they are, but there's no easy way to figure out what that means in
terms of wall-clock time, which I think would be useful.

Jan Wieck had a detailed proposal to make that happen at last developper
meeting, but then ran out of time to implement it for 9.1 it seems. The
idea was basically to have a ticker in core, an SRF that would associate
txid_snapshot with wall clock time. Lots of good things would come from
that.

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php

Of course if you think that's important enough for you to implement it
between now and beta, that would be great :)

I think that's actually something a little different, and more
complicated, but I do think it'd be useful. I was hoping there was a
simple way to get some kind of time-based information into
pg_stat_replication, but if there isn't, there isn't.

How about sending the timestamp of last applied transaction
(i.e., this is the return value of pg_last_xact_replay_timestamp)
from the standby to the master, and reporting it in
pg_stat_replication? Then you can see the lag by comparing
it with current_timestamp.

But since the last replay timestamp doesn't advance (but
current timestamp advances) if there is no work on the master,
the calculated lag might be unexpectedly too large. So, to
calculate the exact lag, I'm thinking that we should introduce
new function which returns the timestamp of the last transaction
written in the master.

Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#79

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#78)

Re: Sync Rep v19

On Fri, Mar 11, 2011 at 7:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Mar 11, 2011 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 10, 2011 at 3:29 PM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

they are, but there's no easy way to figure out what that means in
terms of wall-clock time, which I think would be useful.

Jan Wieck had a detailed proposal to make that happen at last developper
meeting, but then ran out of time to implement it for 9.1 it seems. The
idea was basically to have a ticker in core, an SRF that would associate
txid_snapshot with wall clock time. Lots of good things would come from
that.

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01209.php

Of course if you think that's important enough for you to implement it
between now and beta, that would be great :)

I think that's actually something a little different, and more
complicated, but I do think it'd be useful. I was hoping there was a
simple way to get some kind of time-based information into
pg_stat_replication, but if there isn't, there isn't.

How about sending the timestamp of last applied transaction
(i.e., this is the return value of pg_last_xact_replay_timestamp)
from the standby to the master, and reporting it in
pg_stat_replication? Then you can see the lag by comparing
it with current_timestamp.

But since the last replay timestamp doesn't advance (but
current timestamp advances) if there is no work on the master,
the calculated lag might be unexpectedly too large. So, to
calculate the exact lag, I'm thinking that we should introduce
new function which returns the timestamp of the last transaction
written in the master.

Thought?

Hmm... where would we get that value from? And what if no
transactions are running on the master?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#80

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Robert Haas (#79)

Re: Sync Rep v19

On Fri, Mar 11, 2011 at 10:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:

How about sending the timestamp of last applied transaction
(i.e., this is the return value of pg_last_xact_replay_timestamp)
from the standby to the master, and reporting it in
pg_stat_replication? Then you can see the lag by comparing
it with current_timestamp.

But since the last replay timestamp doesn't advance (but
current timestamp advances) if there is no work on the master,
the calculated lag might be unexpectedly too large. So, to
calculate the exact lag, I'm thinking that we should introduce
new function which returns the timestamp of the last transaction
written in the master.

Thought?

Hmm... where would we get that value from?

xl_xact_commit->xact_time (which is set in RecordTransactionCommit)
and xl_xact_abort->xact_time (which is set in RecordTransactionAbort).

And what if no
transactions are running on the master?

In that case, the last write WAL timestamp would become equal to the
last replay WAL timestamp. So we can see that there is no lag.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#81

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#80)

Re: Sync Rep v19

On Fri, Mar 11, 2011 at 8:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Mar 11, 2011 at 10:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:

How about sending the timestamp of last applied transaction
(i.e., this is the return value of pg_last_xact_replay_timestamp)
from the standby to the master, and reporting it in
pg_stat_replication? Then you can see the lag by comparing
it with current_timestamp.

But since the last replay timestamp doesn't advance (but
current timestamp advances) if there is no work on the master,
the calculated lag might be unexpectedly too large. So, to
calculate the exact lag, I'm thinking that we should introduce
new function which returns the timestamp of the last transaction
written in the master.

Thought?

Hmm... where would we get that value from?

xl_xact_commit->xact_time (which is set in RecordTransactionCommit)
and xl_xact_abort->xact_time (which is set in RecordTransactionAbort).

And what if no
transactions are running on the master?

In that case, the last write WAL timestamp would become equal to the
last replay WAL timestamp. So we can see that there is no lag.

Oh, I see (I think). You're talking about write/replay lag, but I was
thinking of master/slave transmission lag.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#82

Ross J. Reedstrom

reedstrm@rice.edu

almost 15 years ago

In reply to: Robert Haas (#81)

Re: Sync Rep v19

On Fri, Mar 11, 2011 at 09:03:33AM -0500, Robert Haas wrote:

On Fri, Mar 11, 2011 at 8:21 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

In that case, the last write WAL timestamp would become equal to the
last replay WAL timestamp. So we can see that there is no lag.

Oh, I see (I think). You're talking about write/replay lag, but I was
thinking of master/slave transmission lag.

Which are both useful numbers to know: the first tells you how "stale"
queries from a Hot Standby will be, the second tells you the maximum
data loss from a "meteor hits the master" scenario where that slave is
promoted, if I understand all the interactions correctly.

Ross
--
Ross Reedstrom, Ph.D. reedstrm@rice.edu
Systems Engineer & Admin, Research Scientist phone: 713-348-6166
Connexions http://cnx.org fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE

#83

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#68)

Re: Sync Rep v19

On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

* Smart shutdown
Smart shutdown should wait for all the waiting backends to be acked, and
should not cause them to forcibly exit. But this leads shutdown to get stuck
infinitely if there is no walsender at that time. To enable them to be acked
even in that situation, we need to change postmaster so that it accepts the
replication connection even during smart shutdown (until we reach
PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
connection to cancel backup during smart shutdown. So I don't think that
the idea to accept the replication connection during smart shutdown is so
ugly.

* Fast shutdown
I agree with you about fast shutdown. Fast shutdown should cause all the
backends including waiting ones to exit immediately. At that time, the
non-acked backend should not return the success, according to the
definition of sync rep. So we need to change a backend so that it gets rid
of itself from the waiting queue and exits before returning the success,
when it receives SIGTERM. This change leads the waiting backends to
do the same even when pg_terminate_backend is called. But since
they've not been acked yet, it seems to be reasonable to prevent them
from returning the COMMIT.

Comments? I'll create the patch barring objection.

The fast smart shutdown part of this problem has been addressed. The
smart shutdown case still needs work, and I think the consensus was
that your proposal above was the best way to go with it.

Do you still want to work up a patch for this? If so, I can review.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#84

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Robert Haas (#83)

Re: Sync Rep v19

On Fri, Mar 18, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

* Smart shutdown
Smart shutdown should wait for all the waiting backends to be acked, and
should not cause them to forcibly exit. But this leads shutdown to get stuck
infinitely if there is no walsender at that time. To enable them to be acked
even in that situation, we need to change postmaster so that it accepts the
replication connection even during smart shutdown (until we reach
PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
connection to cancel backup during smart shutdown. So I don't think that
the idea to accept the replication connection during smart shutdown is so
ugly.

* Fast shutdown
I agree with you about fast shutdown. Fast shutdown should cause all the
backends including waiting ones to exit immediately. At that time, the
non-acked backend should not return the success, according to the
definition of sync rep. So we need to change a backend so that it gets rid
of itself from the waiting queue and exits before returning the success,
when it receives SIGTERM. This change leads the waiting backends to
do the same even when pg_terminate_backend is called. But since
they've not been acked yet, it seems to be reasonable to prevent them
from returning the COMMIT.

Comments? I'll create the patch barring objection.

The fast smart shutdown part of this problem has been addressed. The

Ugh. I mean "the fast shutdown", of course, not "the fast smart
shutdown". Anyway, point is:

fast shutdown now OK
smart shutdown still not OK
do you want to write a patch?

:-)

smart shutdown case still needs work, and I think the consensus was
that your proposal above was the best way to go with it.

Do you still want to work up a patch for this? If so, I can review.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#85

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Robert Haas (#84)

Re: Sync Rep v19

On Sat, Mar 19, 2011 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 18, 2011 at 10:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

* Smart shutdown
Smart shutdown should wait for all the waiting backends to be acked, and
should not cause them to forcibly exit. But this leads shutdown to get stuck
infinitely if there is no walsender at that time. To enable them to be acked
even in that situation, we need to change postmaster so that it accepts the
replication connection even during smart shutdown (until we reach
PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
connection to cancel backup during smart shutdown. So I don't think that
the idea to accept the replication connection during smart shutdown is so
ugly.

* Fast shutdown
I agree with you about fast shutdown. Fast shutdown should cause all the
backends including waiting ones to exit immediately. At that time, the
non-acked backend should not return the success, according to the
definition of sync rep. So we need to change a backend so that it gets rid
of itself from the waiting queue and exits before returning the success,
when it receives SIGTERM. This change leads the waiting backends to
do the same even when pg_terminate_backend is called. But since
they've not been acked yet, it seems to be reasonable to prevent them
from returning the COMMIT.

Comments? I'll create the patch barring objection.

The fast smart shutdown part of this problem has been addressed. The

Ugh. I mean "the fast shutdown", of course, not "the fast smart
shutdown". Anyway, point is:

fast shutdown now OK
smart shutdown still not OK
do you want to write a patch?

:-)

smart shutdown case still needs work, and I think the consensus was
that your proposal above was the best way to go with it.

Do you still want to work up a patch for this? If so, I can review.

Sure. Will do.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#86

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Fujii Masao (#85)

1 attachment(s)

Re: Sync Rep v19

On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Do you still want to work up a patch for this? If so, I can review.

Sure. Will do.

The attached patch allows standby servers to connect during smart shutdown
in order to wake up backends waiting for sync rep.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

allow_standby_to_connect_during_smart_shutdown_v1.patchapplication/octet-stream; name=allow_standby_to_connect_during_smart_shutdown_v1.patchDownload

*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 959,964 **** synchronous_replication = on
--- 959,970 ----
     </para>
  
     <para>
+     When a smart shutdown is requested, new replication connections are
+     allowed in order to transfer all outstanding WAL records to standby
+     servers and wake up backends waiting for synchronous replication.
+    </para>
+ 
+    <para>
      Users will stop waiting if a fast shutdown is requested, though the
      server does not fully shutdown until all outstanding WAL records are
      transferred to standby servers.
*** a/doc/src/sgml/runtime.sgml
--- b/doc/src/sgml/runtime.sgml
***************
*** 1386,1392 **** echo -17 > /proc/self/oom_adj
         until online backup mode is no longer active.  While backup mode is
         active, new connections will still be allowed, but only to superusers
         (this exception allows a superuser to connect to terminate
!        online backup mode).  If the server is in recovery when a smart
         shutdown is requested, recovery and streaming replication will be
         stopped only after all regular sessions have terminated.
        </para>
--- 1386,1396 ----
         until online backup mode is no longer active.  While backup mode is
         active, new connections will still be allowed, but only to superusers
         (this exception allows a superuser to connect to terminate
!        online backup mode).   While regular sessions are open, new replication
!        connections will still be allowed (this exception allows WAL sender
!        process to send all outstanding WAL records to standby servers and
!        wake up backends waiting for synchronous replication).
!        If the server is in recovery when a smart
         shutdown is requested, recovery and streaming replication will be
         stopped only after all regular sessions have terminated.
        </para>
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 248,254 **** static bool RecoveryError = false;		/* T if WAL recovery failed */
   *
   * Normal child backends can only be launched when we are in PM_RUN or
   * PM_HOT_STANDBY state.  (We also allow launch of normal
!  * child backends in PM_WAIT_BACKUP state, but only for superusers.)
   * In other states we handle connection requests by launching "dead_end"
   * child processes, which will simply send the client an error message and
   * quit.  (We track these in the BackendList so that we can know when they
--- 248,255 ----
   *
   * Normal child backends can only be launched when we are in PM_RUN or
   * PM_HOT_STANDBY state.  (We also allow launch of normal
!  * child backends in PM_WAIT_BACKUP_AND_SYNCREP state, but only for
!  * superusers and walsenders.)
   * In other states we handle connection requests by launching "dead_end"
   * child processes, which will simply send the client an error message and
   * quit.  (We track these in the BackendList so that we can know when they
***************
*** 276,282 **** typedef enum
  	PM_RECOVERY,				/* in archive recovery mode */
  	PM_HOT_STANDBY,				/* in hot standby mode */
  	PM_RUN,						/* normal "database is alive" state */
! 	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
  	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
  	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
  	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
--- 277,284 ----
  	PM_RECOVERY,				/* in archive recovery mode */
  	PM_HOT_STANDBY,				/* in hot standby mode */
  	PM_RUN,						/* normal "database is alive" state */
! 	PM_WAIT_BACKUP_AND_SYNCREP,	/* waiting for online backup mode and regular
! 								 * backends (waiting for sync rep) to end */
  	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
  	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
  	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
***************
*** 1850,1856 **** retry1:
  					(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
  					 errmsg("sorry, too many clients already")));
  			break;
! 		case CAC_WAITBACKUP:
  			/* OK for now, will check in InitPostgres */
  			break;
  		case CAC_OK:
--- 1852,1858 ----
  					(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
  					 errmsg("sorry, too many clients already")));
  			break;
! 		case CAC_WAIT_BACKUP_AND_SYNCREP:
  			/* OK for now, will check in InitPostgres */
  			break;
  		case CAC_OK:
***************
*** 1934,1949 **** canAcceptConnections(void)
  	 * Can't start backends when in startup/shutdown/inconsistent recovery
  	 * state.
  	 *
! 	 * In state PM_WAIT_BACKUP only superusers can connect (this must be
! 	 * allowed so that a superuser can end online backup mode); we return
! 	 * CAC_WAITBACKUP code to indicate that this must be checked later.
! 	 * Note that neither CAC_OK nor CAC_WAITBACKUP can safely be returned
! 	 * until we have checked for too many children.
  	 */
  	if (pmState != PM_RUN)
  	{
! 		if (pmState == PM_WAIT_BACKUP)
! 			result = CAC_WAITBACKUP;	/* allow superusers only */
  		else if (Shutdown > NoShutdown)
  			return CAC_SHUTDOWN;	/* shutdown is pending */
  		else if (!FatalError &&
--- 1936,1952 ----
  	 * Can't start backends when in startup/shutdown/inconsistent recovery
  	 * state.
  	 *
! 	 * In PM_WAIT_BACKUP_AND_SYNCREP state only superusers and standby servers
! 	 * can connect (this must be allowed so that a superuser can end online
! 	 * backup mode and walsender can wake up backends waiting for sync rep);
! 	 * we return CAC_WAIT_BACKUP_AND_SYNCREP code to indicate that this must
! 	 * be checked later. Note that neither CAC_OK nor CAC_WAIT_BACKUP_AND_SYNCREP
! 	 * can safely be returned until we have checked for too many children.
  	 */
  	if (pmState != PM_RUN)
  	{
! 		if (pmState == PM_WAIT_BACKUP_AND_SYNCREP)
! 			result = CAC_WAIT_BACKUP_AND_SYNCREP;	/* allow superusers and walsenders only */
  		else if (Shutdown > NoShutdown)
  			return CAC_SHUTDOWN;	/* shutdown is pending */
  		else if (!FatalError &&
***************
*** 2214,2220 **** pmdie(SIGNAL_ARGS)
  				 * and walreceiver processes.
  				 */
  				pmState = (pmState == PM_RUN) ?
! 					PM_WAIT_BACKUP : PM_WAIT_READONLY;
  			}
  
  			/*
--- 2217,2223 ----
  				 * and walreceiver processes.
  				 */
  				pmState = (pmState == PM_RUN) ?
! 					PM_WAIT_BACKUP_AND_SYNCREP : PM_WAIT_READONLY;
  			}
  
  			/*
***************
*** 2249,2255 **** pmdie(SIGNAL_ARGS)
  				pmState = PM_WAIT_BACKENDS;
  			}
  			else if (pmState == PM_RUN ||
! 					 pmState == PM_WAIT_BACKUP ||
  					 pmState == PM_WAIT_READONLY ||
  					 pmState == PM_WAIT_BACKENDS ||
  					 pmState == PM_HOT_STANDBY)
--- 2252,2258 ----
  				pmState = PM_WAIT_BACKENDS;
  			}
  			else if (pmState == PM_RUN ||
! 					 pmState == PM_WAIT_BACKUP_AND_SYNCREP ||
  					 pmState == PM_WAIT_READONLY ||
  					 pmState == PM_WAIT_BACKENDS ||
  					 pmState == PM_HOT_STANDBY)
***************
*** 2828,2834 **** HandleChildCrash(int pid, int exitstatus, const char *procname)
  	if (pmState == PM_RECOVERY ||
  		pmState == PM_HOT_STANDBY ||
  		pmState == PM_RUN ||
! 		pmState == PM_WAIT_BACKUP ||
  		pmState == PM_WAIT_READONLY ||
  		pmState == PM_SHUTDOWN)
  		pmState = PM_WAIT_BACKENDS;
--- 2831,2837 ----
  	if (pmState == PM_RECOVERY ||
  		pmState == PM_HOT_STANDBY ||
  		pmState == PM_RUN ||
! 		pmState == PM_WAIT_BACKUP_AND_SYNCREP ||
  		pmState == PM_WAIT_READONLY ||
  		pmState == PM_SHUTDOWN)
  		pmState = PM_WAIT_BACKENDS;
***************
*** 2896,2907 **** LogChildExit(int lev, const char *procname, int pid, int exitstatus)
  static void
  PostmasterStateMachine(void)
  {
! 	if (pmState == PM_WAIT_BACKUP)
  	{
  		/*
! 		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
  		 */
! 		if (!BackupInProgress())
  			pmState = PM_WAIT_BACKENDS;
  	}
  
--- 2899,2911 ----
  static void
  PostmasterStateMachine(void)
  {
! 	if (pmState == PM_WAIT_BACKUP_AND_SYNCREP)
  	{
  		/*
! 		 * PM_WAIT_BACKUP_AND_SYNCREP state ends when online backup mode is
! 		 * not active and there is no regular backend waiting for sync rep.
  		 */
! 		if (!BackupInProgress() && CountChildren(BACKEND_TYPE_NORMAL) == 0)
  			pmState = PM_WAIT_BACKENDS;
  	}
  
***************
*** 3233,3239 **** BackendStartup(Port *port)
  	/* Pass down canAcceptConnections state */
  	port->canAcceptConnections = canAcceptConnections();
  	bn->dead_end = (port->canAcceptConnections != CAC_OK &&
! 					port->canAcceptConnections != CAC_WAITBACKUP);
  
  	/*
  	 * Unless it's a dead_end child, assign it a child slot number
--- 3237,3243 ----
  	/* Pass down canAcceptConnections state */
  	port->canAcceptConnections = canAcceptConnections();
  	bn->dead_end = (port->canAcceptConnections != CAC_OK &&
! 					port->canAcceptConnections != CAC_WAIT_BACKUP_AND_SYNCREP);
  
  	/*
  	 * Unless it's a dead_end child, assign it a child slot number
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
***************
*** 608,628 **** InitPostgres(const char *in_dbname, Oid dboid, const char *username,
  	}
  
  	/*
! 	 * If we're trying to shut down, only superusers can connect, and new
! 	 * replication connections are not allowed.
  	 */
! 	if ((!am_superuser || am_walsender) &&
  		MyProcPort != NULL &&
! 		MyProcPort->canAcceptConnections == CAC_WAITBACKUP)
  	{
! 		if (am_walsender)
! 			ereport(FATAL,
! 					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 					 errmsg("new replication connections are not allowed during database shutdown")));
! 		else
! 			ereport(FATAL,
! 					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 			errmsg("must be superuser to connect during database shutdown")));
  	}
  
  	/*
--- 608,624 ----
  	}
  
  	/*
! 	 * If we're trying to shut down, only superusers and standby servers
! 	 * can connect.
  	 */
! 	if (!am_superuser &&
! 		!am_walsender &&
  		MyProcPort != NULL &&
! 		MyProcPort->canAcceptConnections == CAC_WAIT_BACKUP_AND_SYNCREP)
  	{
! 		ereport(FATAL,
! 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 				 errmsg("must be superuser or standby server to connect during database shutdown")));
  	}
  
  	/*
*** a/src/include/libpq/libpq-be.h
--- b/src/include/libpq/libpq-be.h
***************
*** 73,79 **** typedef struct
  typedef enum CAC_state
  {
  	CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
! 	CAC_WAITBACKUP
  } CAC_state;
  
  
--- 73,79 ----
  typedef enum CAC_state
  {
  	CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
! 	CAC_WAIT_BACKUP_AND_SYNCREP
  } CAC_state;

#87

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#86)

Re: Sync Rep v19

On Thu, Mar 24, 2011 at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Do you still want to work up a patch for this? If so, I can review.

Sure. Will do.

The attached patch allows standby servers to connect during smart shutdown
in order to wake up backends waiting for sync rep.

I think that is possibly OK, but the big problem is the lack of a
clear set of comments about how the states are supposed to interact
that allow these changes to be validated.

That state isn't down to you, but I think we need that clear so this
change is more obviously correct than it is now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#88

Fujii Masao

masao.fujii@gmail.com

almost 15 years ago

In reply to: Simon Riggs (#87)

Re: Sync Rep v19

On Thu, Mar 24, 2011 at 8:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Mar 24, 2011 at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Do you still want to work up a patch for this? If so, I can review.

Sure. Will do.

The attached patch allows standby servers to connect during smart shutdown
in order to wake up backends waiting for sync rep.

I think that is possibly OK, but the big problem is the lack of a
clear set of comments about how the states are supposed to interact
that allow these changes to be validated.

That state isn't down to you, but I think we need that clear so this
change is more obviously correct than it is now.

src/backend/replication/README needs to be updated to make that clear?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#89

Simon Riggs

simon@2ndQuadrant.com

almost 15 years ago

In reply to: Fujii Masao (#88)

Re: Sync Rep v19

On Thu, Mar 24, 2011 at 11:53 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Mar 24, 2011 at 8:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Mar 24, 2011 at 11:17 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Mar 23, 2011 at 5:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Do you still want to work up a patch for this? If so, I can review.

Sure. Will do.

The attached patch allows standby servers to connect during smart shutdown
in order to wake up backends waiting for sync rep.

I think that is possibly OK, but the big problem is the lack of a
clear set of comments about how the states are supposed to interact
that allow these changes to be validated.

That state isn't down to you, but I think we need that clear so this
change is more obviously correct than it is now.

src/backend/replication/README needs to be updated to make that clear?

Not sure where we'd put them.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#90

Bruce Momjian

bruce@momjian.us

over 14 years ago

In reply to: Simon Riggs (#72)

1 attachment(s)

Re: Sync Rep v19

Simon Riggs wrote:

On Wed, 2011-03-09 at 21:21 -0500, Bruce Momjian wrote:

Simon Riggs wrote:

On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:

postgres=# SELECT application_name, state, sync_priority, sync_state
FROM pg_stat_replication;
application_name | state | sync_priority | sync_state
------------------+-----------+---------------+------------
one | STREAMING | 1 | POTENTIAL
two | streaming | 2 | sync
(2 rows)

Bug! Thanks.

Is there a reason these status are all upper-case?

NOT AS FAR AS I KNOW.

I'll add it to the list of changes for beta.

The attached patch lowercases the labels displayed in the view above.
(The example above was originally all upper-case.)

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

/rtmp/labeltext/x-diffDownload

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
new file mode 100644
index af3c95a..470e6d1
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
*************** WalSndGetStateString(WalSndState state)
*** 1350,1362 ****
  	switch (state)
  	{
  		case WALSNDSTATE_STARTUP:
! 			return "STARTUP";
  		case WALSNDSTATE_BACKUP:
! 			return "BACKUP";
  		case WALSNDSTATE_CATCHUP:
! 			return "CATCHUP";
  		case WALSNDSTATE_STREAMING:
! 			return "STREAMING";
  	}
  	return "UNKNOWN";
  }
--- 1350,1362 ----
  	switch (state)
  	{
  		case WALSNDSTATE_STARTUP:
! 			return "startup";
  		case WALSNDSTATE_BACKUP:
! 			return "backup";
  		case WALSNDSTATE_CATCHUP:
! 			return "catchup";
  		case WALSNDSTATE_STREAMING:
! 			return "streaming";
  	}
  	return "UNKNOWN";
  }
*************** pg_stat_get_wal_senders(PG_FUNCTION_ARGS
*** 1501,1511 ****
  			 * informational, not different from priority.
  			 */
  			if (sync_priority[i] == 0)
! 				values[7] = CStringGetTextDatum("ASYNC");
  			else if (i == sync_standby)
! 				values[7] = CStringGetTextDatum("SYNC");
  			else
! 				values[7] = CStringGetTextDatum("POTENTIAL");
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
--- 1501,1511 ----
  			 * informational, not different from priority.
  			 */
  			if (sync_priority[i] == 0)
! 				values[7] = CStringGetTextDatum("async");
  			else if (i == sync_standby)
! 				values[7] = CStringGetTextDatum("sync");
  			else
! 				values[7] = CStringGetTextDatum("potential");
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);

#91

Bruce Momjian

bruce@momjian.us

over 14 years ago

In reply to: Bruce Momjian (#90)

Re: Sync Rep v19

Bruce Momjian wrote:

Simon Riggs wrote:

On Wed, 2011-03-09 at 21:21 -0500, Bruce Momjian wrote:

Simon Riggs wrote:

On Fri, 2011-03-04 at 23:15 +0900, Fujii Masao wrote:

postgres=# SELECT application_name, state, sync_priority, sync_state
FROM pg_stat_replication;
application_name | state | sync_priority | sync_state
------------------+-----------+---------------+------------
one | STREAMING | 1 | POTENTIAL
two | streaming | 2 | sync
(2 rows)

Bug! Thanks.

Is there a reason these status are all upper-case?

NOT AS FAR AS I KNOW.

I'll add it to the list of changes for beta.

The attached patch lowercases the labels displayed in the view above.
(The example above was originally all upper-case.)

Applied.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +