Support for N synchronous standby servers

Started by Michael Paquierover 11 years ago33 messages

michael.paquier@gmail.com

over 11 years ago

1 attachment(s)

Hi all,

Please find attached a patch to add support of synchronous replication
for multiple standby servers. This is controlled by the addition of a
new GUC parameter called synchronous_standby_num, that makes server
wait for transaction commit on the first N standbys defined in
synchronous_standby_names. The implementation is really
straight-forward, and has just needed a couple of modifications in
walsender.c for pg_stat_get_wal_senders and syncrep.c.

When a process commit is cancelled manually by user or when
ProcDiePending shows up, the message returned to user does not show
the list of walsenders where the commit has not been confirmed as it
partially confirmed. I have not done anything for that but let me know
if that would be useful. This would need a scan of the walsenders to
get their application_name.

Thanks,
--
Michael

Attachments:

0001-syncrep_multi_standbys.patchtext/x-diff; charset=US-ASCII; name=0001-syncrep_multi_standbys.patchDownload

From 3dfff90032c38daba43e1e0c4d3221053d6386ac Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Sat, 9 Aug 2014 14:49:24 +0900
Subject: [PATCH] Add parameter synchronous_standby_num

This makes possible support of synchronous replication on a number of standby
nodes equal to the new parameter. The synchronous standbys are chosen in the
order they are listed in synchronous_standby_names.
---
 doc/src/sgml/config.sgml            | 32 ++++++++++++---
 doc/src/sgml/high-availability.sgml | 18 ++++-----
 src/backend/replication/syncrep.c   | 81 ++++++++++++++++++++++++++++++-------
 src/backend/replication/walsender.c | 74 ++++++++++++++++++++++++++++-----
 src/backend/utils/misc/guc.c        | 10 +++++
 src/include/replication/syncrep.h   |  1 +
 6 files changed, 175 insertions(+), 41 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index be5c25b..c40de16 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2586,12 +2586,13 @@ include_dir 'conf.d'
         Specifies a comma-separated list of standby names that can support
         <firstterm>synchronous replication</>, as described in
         <xref linkend="synchronous-replication">.
-        At any one time there will be at most one active synchronous standby;
-        transactions waiting for commit will be allowed to proceed after
-        this standby server confirms receipt of their data.
-        The synchronous standby will be the first standby named in this list
-        that is both currently connected and streaming data in real-time
-        (as shown by a state of <literal>streaming</literal> in the
+        At any one time there will be at a number of active synchronous standbys
+        defined by <varname>synchronous_standby_num</>; transactions waiting
+        for commit will be allowed to proceed after those standby servers
+        confirms receipt of their data. The synchronous standbys will be
+        the first entries named in this list that are both currently connected
+        and streaming data in real-time (as shown by a state of
+        <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
         Other standby servers appearing later in this list represent potential
@@ -2627,6 +2628,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-synchronous-standby-num" xreflabel="synchronous_standby_num">
+      <term><varname>synchronous_standby_num</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>synchronous_standby_num</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the number of standbys that support
+        <firstterm>synchronous replication</>, as described in
+        <xref linkend="synchronous-replication">, and listed as the first
+        elements of <xref linkend="guc-synchronous-standby-names">.
+       </para>
+       <para>
+        Default value is 1.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
       <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index d249959..085d51b 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1081,12 +1081,12 @@ primary_slot_name = 'node_a_slot'
     WAL record is then sent to the standby. The standby sends reply
     messages each time a new batch of WAL data is written to disk, unless
     <varname>wal_receiver_status_interval</> is set to zero on the standby.
-    If the standby is the first matching standby, as specified in
-    <varname>synchronous_standby_names</> on the primary, the reply
-    messages from that standby will be used to wake users waiting for
-    confirmation that the commit record has been received. These parameters
-    allow the administrator to specify which standby servers should be
-    synchronous standbys. Note that the configuration of synchronous
+    If the standby is the first <varname>synchronous_standby_num</> matching
+    standbys, as specified in <varname>synchronous_standby_names</> on the
+    primary, the reply messages from that standby will be used to wake users
+    waiting for confirmation that the commit record has been received. These
+    parameters allow the administrator to specify which standby servers should
+    be synchronous standbys. Note that the configuration of synchronous
     replication is mainly on the master. Named standbys must be directly
     connected to the master; the master knows nothing about downstream
     standby servers using cascaded replication.
@@ -1169,9 +1169,9 @@ primary_slot_name = 'node_a_slot'
     The best solution for avoiding data loss is to ensure you don't lose
     your last remaining synchronous standby. This can be achieved by naming multiple
     potential synchronous standbys using <varname>synchronous_standby_names</>.
-    The first named standby will be used as the synchronous standby. Standbys
-    listed after this will take over the role of synchronous standby if the
-    first one should fail.
+    The first <varname>synchronous_standby_num</> named standbys will be used as
+    the synchronous standbys. Standbys listed after this will take over the role
+    of synchronous standby if the first one should fail.
    </para>
 
    <para>
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index aa54bfb..524ff6c 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -59,6 +59,7 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+int			synchronous_standby_num = 1;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
@@ -206,7 +207,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			ereport(WARNING,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
 			whereToSendOutput = DestNone;
 			SyncRepCancelWait();
 			break;
@@ -223,7 +224,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			QueryCancelPending = false;
 			ereport(WARNING,
 					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
 			SyncRepCancelWait();
 			break;
 		}
@@ -368,11 +369,15 @@ void
 SyncRepReleaseWaiters(void)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
-	volatile WalSnd *syncWalSnd = NULL;
+	volatile WalSnd *syncWalSnd[synchronous_standby_num];
 	int			numwrite = 0;
 	int			numflush = 0;
 	int			priority = 0;
+	int			num_sync = 0;
 	int			i;
+	bool		found = false;
+
+	syncWalSnd[0] = NULL;
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
@@ -388,7 +393,7 @@ SyncRepReleaseWaiters(void)
 	/*
 	 * We're a potential sync standby. Release waiters if we are the highest
 	 * priority standby. If there are multiple standbys with same priorities
-	 * then we use the first mentioned standby. If you change this, also
+	 * then we use the first mentioned standbys. If you change this, also
 	 * change pg_stat_get_wal_senders().
 	 */
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
@@ -398,33 +403,79 @@ SyncRepReleaseWaiters(void)
 		/* use volatile pointer to prevent code rearrangement */
 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
 
-		if (walsnd->pid != 0 &&
-			walsnd->state == WALSNDSTATE_STREAMING &&
-			walsnd->sync_standby_priority > 0 &&
-			(priority == 0 ||
-			 priority > walsnd->sync_standby_priority) &&
-			!XLogRecPtrIsInvalid(walsnd->flush))
+		/* Leave if not streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Leave if asynchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Leave if priority conditions not satisfied */
+		if (priority != 0 &&
+			priority <= walsnd->sync_standby_priority &&
+			num_sync == synchronous_standby_num)
+			continue;
+
+		/* Leave if invalid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * We have a potential synchronous candidate, add it to the
+		 * list of nodes already present or evict the node with highest
+		 * priority found until now.
+		 */
+
+		if (num_sync == synchronous_standby_num)
+		{
+			int j;
+
+			for (j = 0; j < num_sync; j++)
+			{
+				if (syncWalSnd[j]->sync_standby_priority == priority)
+				{
+					syncWalSnd[j] = walsnd;
+					break;
+				}
+			}
+		}
+		else
 		{
-			priority = walsnd->sync_standby_priority;
-			syncWalSnd = walsnd;
+			syncWalSnd[num_sync] = walsnd;
+			num_sync++;
 		}
+
+		/* Update priority for next tracking */
+		priority = walsnd->sync_standby_priority;
 	}
 
 	/*
 	 * We should have found ourselves at least.
 	 */
-	Assert(syncWalSnd);
+	Assert(syncWalSnd[0]);
 
 	/*
-	 * If we aren't managing the highest priority standby then just leave.
+	 * If we aren't managing one of the highest priority standby then just leave.
 	 */
-	if (syncWalSnd != MyWalSnd)
+	for (i = 0; i < num_sync; i++)
+	{
+		if (syncWalSnd[i] == MyWalSnd)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	/* We are definitely not one of the chosen... */
+	if (!found)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = true;
 		return;
 	}
 
+
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
 	 * this location.
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3189793..8c74c86 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2734,9 +2734,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 	MemoryContext oldcontext;
 	int		   *sync_priority;
 	int			priority = 0;
-	int			sync_standby = -1;
+	int			sync_standbys[max_wal_senders];
+	int			num_sync = 0;
 	int			i;
 
+	sync_standbys[0] = -1;
+
 	/* check to see if caller supports us returning a tuplestore */
 	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
 		ereport(ERROR,
@@ -2784,15 +2787,50 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
 				0 : walsnd->sync_standby_priority;
 
-			if (walsnd->state == WALSNDSTATE_STREAMING &&
-				walsnd->sync_standby_priority > 0 &&
-				(priority == 0 ||
-				 priority > walsnd->sync_standby_priority) &&
-				!XLogRecPtrIsInvalid(walsnd->flush))
+			/* Leave if not streaming */
+			if (walsnd->state != WALSNDSTATE_STREAMING)
+				continue;
+
+			/* Leave if asynchronous */
+			if (walsnd->sync_standby_priority == 0)
+				continue;
+
+			/* Leave if priority conditions not satisfied */
+			if (priority != 0 &&
+				priority <= walsnd->sync_standby_priority &&
+				num_sync == synchronous_standby_num)
+				continue;
+
+			/* Leave if invalid flush position */
+			if (XLogRecPtrIsInvalid(walsnd->flush))
+				continue;
+
+			/*
+			 * We have a potential synchronous candidate, add it to the
+			 * list of nodes already present or evict the node with highest
+			 * priority found until now.
+			 */
+			if (num_sync == synchronous_standby_num)
+			{
+				int j;
+				for (j = 0; j < num_sync; j++)
+				{
+					volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[j]];
+					if (walsndloc->sync_standby_priority == priority)
+					{
+						sync_standbys[j] = i;
+						break;
+					}
+				}
+			}
+			else
 			{
-				priority = walsnd->sync_standby_priority;
-				sync_standby = i;
+				sync_standbys[num_sync] = i;
+				num_sync++;
 			}
+
+			/* Update priority for next tracking */
+			priority = walsnd->sync_standby_priority;
 		}
 	}
 	LWLockRelease(SyncRepLock);
@@ -2856,10 +2894,24 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			if (sync_priority[i] == 0)
 				values[7] = CStringGetTextDatum("async");
-			else if (i == sync_standby)
-				values[7] = CStringGetTextDatum("sync");
 			else
-				values[7] = CStringGetTextDatum("potential");
+			{
+				int j;
+				bool found = false;
+
+				for (j = 0; j < num_sync; j++)
+				{
+					/* Found that this node is one in sync */
+					if (i == sync_standbys[j])
+					{
+						values[7] = CStringGetTextDatum("sync");
+						found = true;
+						break;
+					}
+				}
+				if (!found)
+					values[7] = CStringGetTextDatum("potential");
+			}
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6c52db8..73523db 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2551,6 +2551,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_standby_num", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Number of potential synchronous standbys."),
+			NULL
+		},
+		&synchronous_standby_num,
+		1, 1, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 7eeaf3b..da1cf7c 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -33,6 +33,7 @@
 
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
+extern int	synchronous_standby_num;
 
 /* called by user backend */
 extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
-- 
2.0.4

Fujii Masao

masao.fujii@gmail.com

over 11 years ago

In reply to: Michael Paquier (#1)

Re: Support for N synchronous standby servers

On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Hi all,

Please find attached a patch to add support of synchronous replication
for multiple standby servers. This is controlled by the addition of a
new GUC parameter called synchronous_standby_num, that makes server
wait for transaction commit on the first N standbys defined in
synchronous_standby_names. The implementation is really
straight-forward, and has just needed a couple of modifications in
walsender.c for pg_stat_get_wal_senders and syncrep.c.

Great! This is really the feature which I really want.
Though I forgot why we missed this feature when
we had added the synchronous replication feature,
maybe it's worth reading the old discussion which
may suggest the potential problem of N sync standbys.

I just tested this feature with synchronous_standby_num = 2.
I started up only one synchronous standby and ran
the write transaction. Then the transaction was successfully
completed, i.e., it didn't wait for two standbys. Probably
this is a bug of the patch.

And, you forgot to add the line of synchronous_standby_num
to postgresql.conf.sample.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Fujii Masao (#2)

1 attachment(s)

Re: Support for N synchronous standby servers

On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Great! This is really the feature which I really want.
Though I forgot why we missed this feature when
we had added the synchronous replication feature,
maybe it's worth reading the old discussion which
may suggest the potential problem of N sync standbys.

Sure, I'll double check. Thanks for your comments.

I just tested this feature with synchronous_standby_num = 2.
I started up only one synchronous standby and ran
the write transaction. Then the transaction was successfully
completed, i.e., it didn't wait for two standbys. Probably
this is a bug of the patch.

Oh OK, yes this is a bug of what I did. The number of standbys to wait
for takes precedence on the number of standbys found in the list of
active WAL senders. I changed the patch to take into account that
behavior. So for example if you have only one sync standby connected,
and synchronous_standby_num = 2, client waits indefinitely.

And, you forgot to add the line of synchronous_standby_num
to postgresql.conf.sample.

Yep, right.

On top of that, I refactored the code in such a way that
pg_stat_get_wal_senders and SyncRepReleaseWaiters rely on a single API
to get the list of synchronous standbys found. This reduces code
duplication, duplication that already exists in HEAD...
Regards,
--
Michael

Attachments:

20140811_multi_syncrep_v2.patchtext/x-patch; charset=US-ASCII; name=20140811_multi_syncrep_v2.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2586,2597 **** include_dir 'conf.d'
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at most one active synchronous standby;
!         transactions waiting for commit will be allowed to proceed after
!         this standby server confirms receipt of their data.
!         The synchronous standby will be the first standby named in this list
!         that is both currently connected and streaming data in real-time
!         (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
--- 2586,2598 ----
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at a number of active synchronous standbys
!         defined by <varname>synchronous_standby_num</>; transactions waiting
!         for commit will be allowed to proceed after those standby servers
!         confirms receipt of their data. The synchronous standbys will be
!         the first entries named in this list that are both currently connected
!         and streaming data in real-time (as shown by a state of
!         <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
***************
*** 2627,2632 **** include_dir 'conf.d'
--- 2628,2652 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-synchronous-standby-num" xreflabel="synchronous_standby_num">
+       <term><varname>synchronous_standby_num</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_standby_num</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the number of standbys that support
+         <firstterm>synchronous replication</>, as described in
+         <xref linkend="synchronous-replication">, and listed as the first
+         elements of <xref linkend="guc-synchronous-standby-names">.
+        </para>
+        <para>
+         Default value is 1.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
        <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)
        <indexterm>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1081,1092 **** primary_slot_name = 'node_a_slot'
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first matching standby, as specified in
!     <varname>synchronous_standby_names</> on the primary, the reply
!     messages from that standby will be used to wake users waiting for
!     confirmation that the commit record has been received. These parameters
!     allow the administrator to specify which standby servers should be
!     synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
--- 1081,1092 ----
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first <varname>synchronous_standby_num</> matching
!     standbys, as specified in <varname>synchronous_standby_names</> on the
!     primary, the reply messages from that standby will be used to wake users
!     waiting for confirmation that the commit record has been received. These
!     parameters allow the administrator to specify which standby servers should
!     be synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
***************
*** 1169,1177 **** primary_slot_name = 'node_a_slot'
      The best solution for avoiding data loss is to ensure you don't lose
      your last remaining synchronous standby. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first named standby will be used as the synchronous standby. Standbys
!     listed after this will take over the role of synchronous standby if the
!     first one should fail.
     </para>
  
     <para>
--- 1169,1177 ----
      The best solution for avoiding data loss is to ensure you don't lose
      your last remaining synchronous standby. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first <varname>synchronous_standby_num</> named standbys will be used as
!     the synchronous standbys. Standbys listed after this will take over the role
!     of synchronous standby if the first one should fail.
     </para>
  
     <para>
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 5,11 ****
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the sync standby.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
--- 5,11 ----
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the synchronous standbys.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
***************
*** 59,64 ****
--- 59,65 ----
  
  /* User-settable parameters for sync rep */
  char	   *SyncRepStandbyNames;
+ int			synchronous_standby_num = 1;
  
  #define SyncStandbysDefined() \
  	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
***************
*** 206,212 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
--- 207,213 ----
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
***************
*** 223,229 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			SyncRepCancelWait();
  			break;
  		}
--- 224,230 ----
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			SyncRepCancelWait();
  			break;
  		}
***************
*** 357,365 **** SyncRepInitConfig(void)
  	}
  }
  
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standby-releases-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
--- 358,442 ----
  	}
  }
  
+ 
+ /*
+  * Obtain a palloc'd array containing positions of stanbys currently
+  * considered as synchronous. Caller is responsible for freeing the
+  * data obtained.
+  * Callers of this function should as well take a necessary lock on
+  * SyncRepLock.
+  */
+ int *
+ SyncRepGetSynchronousNodes(int *num_sync)
+ {
+ 	int	   *sync_standbys;
+ 	int		priority = 0;
+ 	int		i;
+ 
+ 	/* Make enough room */
+ 	sync_standbys = (int *) palloc(synchronous_standby_num * sizeof(int));
+ 
+ 	for (i = 0; i < max_wal_senders; i++)
+ 	{
+ 		/* Use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+ 
+ 		/* Process to next if not active */
+ 		if (walsnd->pid == 0)
+ 			continue;
+ 
+ 		/* Process to next if not streaming */
+ 		if (walsnd->state != WALSNDSTATE_STREAMING)
+ 			continue;
+ 
+ 		/* Process to next one if asynchronous */
+ 		if (walsnd->sync_standby_priority == 0)
+ 			continue;
+ 
+ 		/* Process to next one if priority conditions not satisfied */
+ 		if (priority != 0 &&
+ 			priority <= walsnd->sync_standby_priority &&
+ 			*num_sync == synchronous_standby_num)
+ 			continue;
+ 
+ 		/* Process to next one if flush position is invalid */
+ 		if (XLogRecPtrIsInvalid(walsnd->flush))
+ 			continue;
+ 
+ 		/*
+ 		 * We have a potential synchronous candidate, add it to the
+ 		 * list of nodes already present or evict the node with highest
+ 		 * priority found until now.
+ 		 */
+ 		if (*num_sync == synchronous_standby_num)
+ 		{
+ 			int j;
+ 			for (j = 0; j < *num_sync; j++)
+ 			{
+ 				volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[j]];
+ 				if (walsndloc->sync_standby_priority == priority)
+ 				{
+ 					sync_standbys[j] = i;
+ 					break;
+ 				}
+ 			}
+ 		}
+ 		else
+ 		{
+ 			sync_standbys[*num_sync] = i;
+ 			(*num_sync)++;
+ 		}
+ 
+ 		/* Update priority for next tracking */
+ 		priority = walsnd->sync_standby_priority;
+ 	}
+ 
+ 	return sync_standbys;
+ }
+ 
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standbys-release-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
***************
*** 368,378 **** void
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			priority = 0;
  	int			i;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
--- 445,456 ----
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	int		   *sync_standbys;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			num_sync = 0;
  	int			i;
+ 	bool		found = false;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
***************
*** 388,427 **** SyncRepReleaseWaiters(void)
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standby. If you change this, also
! 	 * change pg_stat_get_wal_senders().
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
  
! 	for (i = 0; i < max_wal_senders; i++)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
! 
! 		if (walsnd->pid != 0 &&
! 			walsnd->state == WALSNDSTATE_STREAMING &&
! 			walsnd->sync_standby_priority > 0 &&
! 			(priority == 0 ||
! 			 priority > walsnd->sync_standby_priority) &&
! 			!XLogRecPtrIsInvalid(walsnd->flush))
  		{
! 			priority = walsnd->sync_standby_priority;
! 			syncWalSnd = walsnd;
  		}
  	}
  
  	/*
! 	 * We should have found ourselves at least.
  	 */
! 	Assert(syncWalSnd);
  
  	/*
! 	 * If we aren't managing the highest priority standby then just leave.
  	 */
! 	if (syncWalSnd != MyWalSnd)
  	{
  		LWLockRelease(SyncRepLock);
! 		announce_next_takeover = true;
  		return;
  	}
  
--- 466,516 ----
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standbys.
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  
! 	/*
! 	 * We should have found ourselves at least.
! 	 */
! 	Assert(num_sync > 0);
! 
! 	/*
! 	 * If we aren't managing one of the standbys with highest priority
! 	 * then just leave.
! 	 */
! 	for (i = 0; i < num_sync; i++)
  	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 		if (walsndloc == MyWalSnd)
  		{
! 			found = true;
! 			break;
  		}
  	}
  
  	/*
! 	 * We are definitely not one of the chosen... But we could at the by
! 	 * taking the next takeover.
  	 */
! 	if (!found)
! 	{
! 		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
! 		announce_next_takeover = true;
! 		return;
! 	}
  
  	/*
! 	 * Even if we are one of the chosen standbys, leave if there
! 	 * are less synchronous standbys in waiting state than what is
! 	 * expected by the user.
  	 */
! 	if (num_sync < synchronous_standby_num)
  	{
  		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
  		return;
  	}
  
***************
*** 448,454 **** SyncRepReleaseWaiters(void)
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now the sync standby.
  	 */
  	if (announce_next_takeover)
  	{
--- 537,543 ----
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now a sync standby.
  	 */
  	if (announce_next_takeover)
  	{
***************
*** 457,462 **** SyncRepReleaseWaiters(void)
--- 546,554 ----
  				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
  						application_name, MyWalSnd->sync_standby_priority)));
  	}
+ 
+ 	/* Clean up */
+ 	pfree(sync_standbys);
  }
  
  /*
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 2733,2740 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int			priority = 0;
! 	int			sync_standby = -1;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
--- 2733,2740 ----
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int		   *sync_standbys;
! 	int			num_sync = 0;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
***************
*** 2765,2800 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby. This code must match the code in SyncRepReleaseWaiters().
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		if (walsnd->pid != 0)
! 		{
! 			/*
! 			 * Treat a standby such as a pg_basebackup background process
! 			 * which always returns an invalid flush location, as an
! 			 * asynchronous standby.
! 			 */
! 			sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 				0 : walsnd->sync_standby_priority;
! 
! 			if (walsnd->state == WALSNDSTATE_STREAMING &&
! 				walsnd->sync_standby_priority > 0 &&
! 				(priority == 0 ||
! 				 priority > walsnd->sync_standby_priority) &&
! 				!XLogRecPtrIsInvalid(walsnd->flush))
! 			{
! 				priority = walsnd->sync_standby_priority;
! 				sync_standby = i;
! 			}
! 		}
  	}
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
--- 2765,2787 ----
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby.
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
+ 
+ 	/* Get first the priorities on each standby as long as we hold a lock */
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 			0 : walsnd->sync_standby_priority;
  	}
+ 
+ 	/* Obtain list of synchronous standbys */
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
***************
*** 2856,2870 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
- 			else if (i == sync_standby)
- 				values[7] = CStringGetTextDatum("sync");
  			else
! 				values[7] = CStringGetTextDatum("potential");
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
  	pfree(sync_priority);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
--- 2843,2872 ----
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
  			else
! 			{
! 				int j;
! 				bool found = false;
! 
! 				for (j = 0; j < num_sync; j++)
! 				{
! 					/* Found that this node is one in sync */
! 					if (i == sync_standbys[j])
! 					{
! 						values[7] = CStringGetTextDatum("sync");
! 						found = true;
! 						break;
! 					}
! 				}
! 				if (!found)
! 					values[7] = CStringGetTextDatum("potential");
! 			}
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
  	pfree(sync_priority);
+ 	pfree(sync_standbys);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2551,2556 **** static struct config_int ConfigureNamesInt[] =
--- 2551,2566 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"synchronous_standby_num", PGC_SIGHUP, REPLICATION_MASTER,
+ 			gettext_noop("Number of potential synchronous standbys."),
+ 			NULL
+ 		},
+ 		&synchronous_standby_num,
+ 		1, 1, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 235,240 ****
--- 235,241 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_standby_num = 1	# number of standbys servers using sync rep
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 33,38 ****
--- 33,39 ----
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
+ extern int	synchronous_standby_num;
  
  /* called by user backend */
  extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
***************
*** 49,54 **** extern void SyncRepUpdateSyncStandbysDefined(void);
--- 50,56 ----
  
  /* called by various procs */
  extern int	SyncRepWakeQueue(bool all, int mode);
+ extern int *SyncRepGetSynchronousNodes(int *num_sync);
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);

Fujii Masao

masao.fujii@gmail.com

over 11 years ago

In reply to: Michael Paquier (#3)

Re: Support for N synchronous standby servers

On Mon, Aug 11, 2014 at 11:54 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Great! This is really the feature which I really want.
Though I forgot why we missed this feature when
we had added the synchronous replication feature,
maybe it's worth reading the old discussion which
may suggest the potential problem of N sync standbys.

Sure, I'll double check. Thanks for your comments.

I just tested this feature with synchronous_standby_num = 2.
I started up only one synchronous standby and ran
the write transaction. Then the transaction was successfully
completed, i.e., it didn't wait for two standbys. Probably
this is a bug of the patch.

Oh OK, yes this is a bug of what I did. The number of standbys to wait
for takes precedence on the number of standbys found in the list of
active WAL senders. I changed the patch to take into account that
behavior. So for example if you have only one sync standby connected,
and synchronous_standby_num = 2, client waits indefinitely.

Thanks for updating the patch! Again I tested the feature and found something
wrong. I set synchronous_standby_num to 2 and started three standbys. Two of
them are included in synchronous_standby_names, i.e., they are synchronous
standbys. That is, the other one standby is always asynchronous. When
I shutdown one of synchronous standbys and executed the write transaction,
the transaction was successfully completed. So the transaction doesn't wait for
two sync standbys in that case. Probably this is a bug.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Fujii Masao (#4)

Re: Support for N synchronous standby servers

On Mon, Aug 11, 2014 at 1:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Thanks for updating the patch! Again I tested the feature and found

something

wrong. I set synchronous_standby_num to 2 and started three standbys. Two

them are included in synchronous_standby_names, i.e., they are synchronous
standbys. That is, the other one standby is always asynchronous. When
I shutdown one of synchronous standbys and executed the write transaction,
the transaction was successfully completed. So the transaction doesn't

wait for

two sync standbys in that case. Probably this is a bug.

Well, that's working in my case :)
Please see below with 4 nodes: 1 master and 3 standbys on same host. Master
listens to 5432, other nodes to 5433, 5434 and 5435. Each standby's
application_name is node_$PORT
=# show synchronous_standby_names ;
synchronous_standby_names
---------------------------
node_5433,node_5434
(1 row)
=# show synchronous_standby_num ;
synchronous_standby_num
-------------------------
2
(1 row)
=# SELECT application_name,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority,
sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, application_name;
application_name | replay_delta | sync_priority | sync_state
------------------+--------------+---------------+------------
node_5433 | 0 | 1 | sync
node_5434 | 0 | 2 | sync
node_5435 | 0 | 0 | async
(3 rows)
=# create table aa (a int);
CREATE TABLE
[...]
-- Stopped node with port 5433:
[...]
=# SELECT application_name,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority,
sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, application_name;
application_name | replay_delta | sync_priority | sync_state
------------------+--------------+---------------+------------
node_5434 | 0 | 2 | sync
node_5435 | 0 | 0 | async
(2 rows)
=# create table ab (a int);
^CCancel request sent
WARNING: 01000: canceling wait for synchronous replication due to user
request
DETAIL: The transaction has already committed locally, but might not have
been replicated to the standby(s).
LOCATION: SyncRepWaitForLSN, syncrep.c:227
CREATE TABLE

Regards,
--
Michael

Fujii Masao

masao.fujii@gmail.com

over 11 years ago

In reply to: Michael Paquier (#5)

Re: Support for N synchronous standby servers

On Mon, Aug 11, 2014 at 2:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Aug 11, 2014 at 1:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Thanks for updating the patch! Again I tested the feature and found
something
wrong. I set synchronous_standby_num to 2 and started three standbys. Two
of
them are included in synchronous_standby_names, i.e., they are synchronous
standbys. That is, the other one standby is always asynchronous. When
I shutdown one of synchronous standbys and executed the write transaction,
the transaction was successfully completed. So the transaction doesn't
wait for
two sync standbys in that case. Probably this is a bug.

Well, that's working in my case :)

Oh, that worked in my machine, too, this time... I did something wrong.
Sorry for the noise.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Fujii Masao (#6)

Re: Support for N synchronous standby servers

On Mon, Aug 11, 2014 at 4:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Aug 11, 2014 at 2:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Oh, that worked in my machine, too, this time... I did something wrong.
Sorry for the noise.

No problem, thanks for spending time testing.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

over 11 years ago

In reply to: Michael Paquier (#7)

Re: Support for N synchronous standby servers

On Mon, Aug 11, 2014 at 4:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Aug 11, 2014 at 4:26 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Aug 11, 2014 at 2:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Oh, that worked in my machine, too, this time... I did something wrong.
Sorry for the noise.

No problem, thanks for spending time testing.

Probably I got the similar but another problem. I set synchronous_standby_num
to 2 and started up two synchronous standbys. When I ran write transactions,
they were successfully completed. That's OK.

I sent the SIGSTOP signal to the walreceiver process in one of sync standbys,
and then ran write transactions again. In this case, they must not be completed
because their WAL cannot be replicated to the standby that its walreceiver
was stopped. But they were successfully completed.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Fujii Masao (#8)

1 attachment(s)

Re: Support for N synchronous standby servers

On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I sent the SIGSTOP signal to the walreceiver process in one of sync standbys,
and then ran write transactions again. In this case, they must not be completed
because their WAL cannot be replicated to the standby that its walreceiver
was stopped. But they were successfully completed.

At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and
SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal
sender in sync, making backends wake up even if other standbys did not
catch up. But we need to scan all the synchronous wal senders and find
the minimum write and flush positions and update walsndctl with those
values. Well that's a code path I forgot to cover.

Attached is an updated patch fixing the problem you reported.

Regards,
--
Michael

Attachments:

20140813_multi_syncrep_v3.patchtext/x-patch; charset=US-ASCII; name=20140813_multi_syncrep_v3.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2586,2597 **** include_dir 'conf.d'
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at most one active synchronous standby;
!         transactions waiting for commit will be allowed to proceed after
!         this standby server confirms receipt of their data.
!         The synchronous standby will be the first standby named in this list
!         that is both currently connected and streaming data in real-time
!         (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
--- 2586,2598 ----
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at a number of active synchronous standbys
!         defined by <varname>synchronous_standby_num</>; transactions waiting
!         for commit will be allowed to proceed after those standby servers
!         confirms receipt of their data. The synchronous standbys will be
!         the first entries named in this list that are both currently connected
!         and streaming data in real-time (as shown by a state of
!         <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
***************
*** 2627,2632 **** include_dir 'conf.d'
--- 2628,2652 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-synchronous-standby-num" xreflabel="synchronous_standby_num">
+       <term><varname>synchronous_standby_num</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_standby_num</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the number of standbys that support
+         <firstterm>synchronous replication</>, as described in
+         <xref linkend="synchronous-replication">, and listed as the first
+         elements of <xref linkend="guc-synchronous-standby-names">.
+        </para>
+        <para>
+         Default value is 1.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
        <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)
        <indexterm>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1081,1092 **** primary_slot_name = 'node_a_slot'
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first matching standby, as specified in
!     <varname>synchronous_standby_names</> on the primary, the reply
!     messages from that standby will be used to wake users waiting for
!     confirmation that the commit record has been received. These parameters
!     allow the administrator to specify which standby servers should be
!     synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
--- 1081,1092 ----
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first <varname>synchronous_standby_num</> matching
!     standbys, as specified in <varname>synchronous_standby_names</> on the
!     primary, the reply messages from that standby will be used to wake users
!     waiting for confirmation that the commit record has been received. These
!     parameters allow the administrator to specify which standby servers should
!     be synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
***************
*** 1169,1177 **** primary_slot_name = 'node_a_slot'
      The best solution for avoiding data loss is to ensure you don't lose
      your last remaining synchronous standby. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first named standby will be used as the synchronous standby. Standbys
!     listed after this will take over the role of synchronous standby if the
!     first one should fail.
     </para>
  
     <para>
--- 1169,1177 ----
      The best solution for avoiding data loss is to ensure you don't lose
      your last remaining synchronous standby. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first <varname>synchronous_standby_num</> named standbys will be used as
!     the synchronous standbys. Standbys listed after this will take over the role
!     of synchronous standby if the first one should fail.
     </para>
  
     <para>
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 5,11 ****
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the sync standby.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
--- 5,11 ----
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the synchronous standbys.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
***************
*** 59,64 ****
--- 59,65 ----
  
  /* User-settable parameters for sync rep */
  char	   *SyncRepStandbyNames;
+ int			synchronous_standby_num = 1;
  
  #define SyncStandbysDefined() \
  	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
***************
*** 206,212 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
--- 207,213 ----
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
***************
*** 223,229 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			SyncRepCancelWait();
  			break;
  		}
--- 224,230 ----
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			SyncRepCancelWait();
  			break;
  		}
***************
*** 357,365 **** SyncRepInitConfig(void)
  	}
  }
  
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standby-releases-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
--- 358,442 ----
  	}
  }
  
+ 
+ /*
+  * Obtain a palloc'd array containing positions of stanbys currently
+  * considered as synchronous. Caller is responsible for freeing the
+  * data obtained.
+  * Callers of this function should as well take a necessary lock on
+  * SyncRepLock.
+  */
+ int *
+ SyncRepGetSynchronousNodes(int *num_sync)
+ {
+ 	int	   *sync_standbys;
+ 	int		priority = 0;
+ 	int		i;
+ 
+ 	/* Make enough room */
+ 	sync_standbys = (int *) palloc(synchronous_standby_num * sizeof(int));
+ 
+ 	for (i = 0; i < max_wal_senders; i++)
+ 	{
+ 		/* Use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+ 
+ 		/* Process to next if not active */
+ 		if (walsnd->pid == 0)
+ 			continue;
+ 
+ 		/* Process to next if not streaming */
+ 		if (walsnd->state != WALSNDSTATE_STREAMING)
+ 			continue;
+ 
+ 		/* Process to next one if asynchronous */
+ 		if (walsnd->sync_standby_priority == 0)
+ 			continue;
+ 
+ 		/* Process to next one if priority conditions not satisfied */
+ 		if (priority != 0 &&
+ 			priority <= walsnd->sync_standby_priority &&
+ 			*num_sync == synchronous_standby_num)
+ 			continue;
+ 
+ 		/* Process to next one if flush position is invalid */
+ 		if (XLogRecPtrIsInvalid(walsnd->flush))
+ 			continue;
+ 
+ 		/*
+ 		 * We have a potential synchronous candidate, add it to the
+ 		 * list of nodes already present or evict the node with highest
+ 		 * priority found until now.
+ 		 */
+ 		if (*num_sync == synchronous_standby_num)
+ 		{
+ 			int j;
+ 			for (j = 0; j < *num_sync; j++)
+ 			{
+ 				volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[j]];
+ 				if (walsndloc->sync_standby_priority == priority)
+ 				{
+ 					sync_standbys[j] = i;
+ 					break;
+ 				}
+ 			}
+ 		}
+ 		else
+ 		{
+ 			sync_standbys[*num_sync] = i;
+ 			(*num_sync)++;
+ 		}
+ 
+ 		/* Update priority for next tracking */
+ 		priority = walsnd->sync_standby_priority;
+ 	}
+ 
+ 	return sync_standbys;
+ }
+ 
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standbys-release-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
***************
*** 368,378 **** void
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			priority = 0;
  	int			i;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
--- 445,458 ----
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	int		   *sync_standbys;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			num_sync = 0;
  	int			i;
+ 	bool		found = false;
+ 	XLogRecPtr	min_write_pos;
+ 	XLogRecPtr	min_flush_pos;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
***************
*** 388,454 **** SyncRepReleaseWaiters(void)
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standby. If you change this, also
! 	 * change pg_stat_get_wal_senders().
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
  
! 	for (i = 0; i < max_wal_senders; i++)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
! 
! 		if (walsnd->pid != 0 &&
! 			walsnd->state == WALSNDSTATE_STREAMING &&
! 			walsnd->sync_standby_priority > 0 &&
! 			(priority == 0 ||
! 			 priority > walsnd->sync_standby_priority) &&
! 			!XLogRecPtrIsInvalid(walsnd->flush))
  		{
! 			priority = walsnd->sync_standby_priority;
! 			syncWalSnd = walsnd;
  		}
  	}
  
  	/*
! 	 * We should have found ourselves at least.
  	 */
! 	Assert(syncWalSnd);
  
  	/*
! 	 * If we aren't managing the highest priority standby then just leave.
  	 */
! 	if (syncWalSnd != MyWalSnd)
  	{
  		LWLockRelease(SyncRepLock);
! 		announce_next_takeover = true;
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location.
  	 */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now the sync standby.
  	 */
  	if (announce_next_takeover)
  	{
--- 468,562 ----
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standbys.
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  
! 	/*
! 	 * We should have found ourselves at least.
! 	 */
! 	Assert(num_sync > 0);
! 
! 	/*
! 	 * If we aren't managing one of the standbys with highest priority
! 	 * then just leave.
! 	 */
! 	for (i = 0; i < num_sync; i++)
  	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 		if (walsndloc == MyWalSnd)
  		{
! 			found = true;
! 			break;
  		}
  	}
  
  	/*
! 	 * We are definitely not one of the chosen... But we could by
! 	 * taking the next takeover.
  	 */
! 	if (!found)
! 	{
! 		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
! 		announce_next_takeover = true;
! 		return;
! 	}
  
  	/*
! 	 * Even if we are one of the chosen standbys, leave if there
! 	 * are less synchronous standbys in waiting state than what is
! 	 * expected by the user.
  	 */
! 	if (num_sync < synchronous_standby_num)
  	{
  		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location, of course only if all the standbys found as synchronous
! 	 * have already reached that point, so first find what are the oldest
! 	 * write and flush positions of all the standbys considered in sync...
  	 */
! 	min_write_pos = MyWalSnd->write;
! 	min_flush_pos = MyWalSnd->flush;
! 	for (i = 0; i < num_sync; i++)
! 	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 
! 		if (min_write_pos > walsndloc->write)
! 			min_write_pos = walsndloc->write;
! 		if (min_flush_pos > walsndloc->flush)
! 			min_flush_pos = walsndloc->flush;
! 	}
! 
! 	/* ... And now update if necessary */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < min_write_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = min_write_pos;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < min_flush_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = min_flush_pos;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_WRITE] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_WRITE],
! 		 numflush, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_FLUSH]);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now a sync standby.
  	 */
  	if (announce_next_takeover)
  	{
***************
*** 457,462 **** SyncRepReleaseWaiters(void)
--- 565,573 ----
  				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
  						application_name, MyWalSnd->sync_standby_priority)));
  	}
+ 
+ 	/* Clean up */
+ 	pfree(sync_standbys);
  }
  
  /*
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 2735,2742 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int			priority = 0;
! 	int			sync_standby = -1;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
--- 2735,2742 ----
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int		   *sync_standbys;
! 	int			num_sync = 0;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
***************
*** 2767,2802 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby. This code must match the code in SyncRepReleaseWaiters().
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		if (walsnd->pid != 0)
! 		{
! 			/*
! 			 * Treat a standby such as a pg_basebackup background process
! 			 * which always returns an invalid flush location, as an
! 			 * asynchronous standby.
! 			 */
! 			sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 				0 : walsnd->sync_standby_priority;
! 
! 			if (walsnd->state == WALSNDSTATE_STREAMING &&
! 				walsnd->sync_standby_priority > 0 &&
! 				(priority == 0 ||
! 				 priority > walsnd->sync_standby_priority) &&
! 				!XLogRecPtrIsInvalid(walsnd->flush))
! 			{
! 				priority = walsnd->sync_standby_priority;
! 				sync_standby = i;
! 			}
! 		}
  	}
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
--- 2767,2789 ----
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby.
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
+ 
+ 	/* Get first the priorities on each standby as long as we hold a lock */
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 			0 : walsnd->sync_standby_priority;
  	}
+ 
+ 	/* Obtain list of synchronous standbys */
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
***************
*** 2858,2872 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
- 			else if (i == sync_standby)
- 				values[7] = CStringGetTextDatum("sync");
  			else
! 				values[7] = CStringGetTextDatum("potential");
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
  	pfree(sync_priority);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
--- 2845,2874 ----
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
  			else
! 			{
! 				int j;
! 				bool found = false;
! 
! 				for (j = 0; j < num_sync; j++)
! 				{
! 					/* Found that this node is one in sync */
! 					if (i == sync_standbys[j])
! 					{
! 						values[7] = CStringGetTextDatum("sync");
! 						found = true;
! 						break;
! 					}
! 				}
! 				if (!found)
! 					values[7] = CStringGetTextDatum("potential");
! 			}
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
  	pfree(sync_priority);
+ 	pfree(sync_standbys);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2548,2553 **** static struct config_int ConfigureNamesInt[] =
--- 2548,2563 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"synchronous_standby_num", PGC_SIGHUP, REPLICATION_MASTER,
+ 			gettext_noop("Number of potential synchronous standbys."),
+ 			NULL
+ 		},
+ 		&synchronous_standby_num,
+ 		1, 1, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 235,240 ****
--- 235,241 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_standby_num = 1	# number of standbys servers using sync rep
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 33,38 ****
--- 33,39 ----
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
+ extern int	synchronous_standby_num;
  
  /* called by user backend */
  extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
***************
*** 49,54 **** extern void SyncRepUpdateSyncStandbysDefined(void);
--- 50,56 ----
  
  /* called by various procs */
  extern int	SyncRepWakeQueue(bool all, int mode);
+ extern int *SyncRepGetSynchronousNodes(int *num_sync);
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);

#10

Fujii Masao

masao.fujii@gmail.com

over 11 years ago

In reply to: Michael Paquier (#9)

Re: Support for N synchronous standby servers

On Wed, Aug 13, 2014 at 4:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I sent the SIGSTOP signal to the walreceiver process in one of sync standbys,
and then ran write transactions again. In this case, they must not be completed
because their WAL cannot be replicated to the standby that its walreceiver
was stopped. But they were successfully completed.

At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and
SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal
sender in sync, making backends wake up even if other standbys did not
catch up. But we need to scan all the synchronous wal senders and find
the minimum write and flush positions and update walsndctl with those
values. Well that's a code path I forgot to cover.

Attached is an updated patch fixing the problem you reported.

+        At any one time there will be at a number of active
synchronous standbys
+        defined by <varname>synchronous_standby_num</>; transactions waiting

It's better to use <xref linkend="guc-synchronous-standby-num">, instead.

+        for commit will be allowed to proceed after those standby servers
+        confirms receipt of their data. The synchronous standbys will be

Typo: confirms -> confirm

+       <para>
+        Specifies the number of standbys that support
+        <firstterm>synchronous replication</>, as described in
+        <xref linkend="synchronous-replication">, and listed as the first
+        elements of <xref linkend="guc-synchronous-standby-names">.
+       </para>
+       <para>
+        Default value is 1.
+       </para>

synchronous_standby_num is defined with PGC_SIGHUP. So the following
should be added into the document.

This parameter can only be set in the postgresql.conf file or on
the server command line.

The name of the parameter "synchronous_standby_num" sounds to me that
the transaction must wait for its WAL to be replicated to s_s_num standbys.
But that's not true in your patch. If s_s_names is empty, replication works
asynchronously whether the value of s_s_num is. I'm afraid that it's confusing.

The description of s_s_num is not sufficient. I'm afraid that users can easily
misunderstand that they can use quorum commit feature by using s_s_names
and s_s_num. That is, the transaction waits for its WAL to be replicated to
any s_s_num standbys listed in s_s_names.

When s_s_num is set to larger value than max_wal_senders, we should warn that?

+    for (i = 0; i < num_sync; i++)
+    {
+        volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
+
+        if (min_write_pos > walsndloc->write)
+            min_write_pos = walsndloc->write;
+        if (min_flush_pos > walsndloc->flush)
+            min_flush_pos = walsndloc->flush;
+    }

I don't think that it's safe to see those shared values without spinlock.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Fujii Masao (#10)

1 attachment(s)

Re: Support for N synchronous standby servers

On Thu, Aug 14, 2014 at 8:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

+        At any one time there will be at a number of active
synchronous standbys
+        defined by <varname>synchronous_standby_num</>; transactions waiting
It's better to use <xref linkend="guc-synchronous-standby-num">, instead.

Fixed.

+        for commit will be allowed to proceed after those standby servers
+        confirms receipt of their data. The synchronous standbys will be

Typo: confirms -> confirm

Fixed.

+       <para>
+        Specifies the number of standbys that support
+        <firstterm>synchronous replication</>, as described in
+        <xref linkend="synchronous-replication">, and listed as the first
+        elements of <xref linkend="guc-synchronous-standby-names">.
+       </para>
+       <para>
+        Default value is 1.
+       </para>
synchronous_standby_num is defined with PGC_SIGHUP. So the following
should be added into the document.

This parameter can only be set in the postgresql.conf file or on
the server command line.

Fixed.

The name of the parameter "synchronous_standby_num" sounds to me that
the transaction must wait for its WAL to be replicated to s_s_num standbys.
But that's not true in your patch. If s_s_names is empty, replication works
asynchronously whether the value of s_s_num is. I'm afraid that it's confusing.
The description of s_s_num is not sufficient. I'm afraid that users can easily
misunderstand that they can use quorum commit feature by using s_s_names
and s_s_num. That is, the transaction waits for its WAL to be replicated to
any s_s_num standbys listed in s_s_names.

I reworked the docs to mention all that. Yes things are a bit
different than any quorum commit facility (how to parametrize that
simply without a parameter mapping one to one the items of
s_s_names?), as this facility relies on the order of the items of
s_s_names and the fact that stansbys are connected at a given time.

When s_s_num is set to larger value than max_wal_senders, we should warn that?

Actually I have done a bit more than that by forbidding setting
s_s_num to a value higher than max_wal_senders. Thoughts?

Now that we discuss the interactions with other parameters. Another
thing that I am wondering about now is: what should we do if we
specify s_s_num to a number higher than the elements in s_s_names?
Currently, the patch gives the priority to s_s_num, in short if we set
s_s_num to 100, server will wait for 100 servers to confirm commit
even if there are less than 100 elements in s_s_names. I chose this
way because it looks saner particularly if s_s_names = '*'. Thoughts
once again?

+    for (i = 0; i < num_sync; i++)
+    {
+        volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
+
+        if (min_write_pos > walsndloc->write)
+            min_write_pos = walsndloc->write;
+        if (min_flush_pos > walsndloc->flush)
+            min_flush_pos = walsndloc->flush;
+    }

I don't think that it's safe to see those shared values without spinlock.

Looking at walsender.c you are right. I have updated the code to use
the mutex lock of the walsender whose values are being read from.

Regards,
--
Michael

On Thu, Aug 14, 2014 at 4:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Aug 13, 2014 at 4:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I sent the SIGSTOP signal to the walreceiver process in one of sync standbys,
and then ran write transactions again. In this case, they must not be completed
because their WAL cannot be replicated to the standby that its walreceiver
was stopped. But they were successfully completed.

At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and
SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal
sender in sync, making backends wake up even if other standbys did not
catch up. But we need to scan all the synchronous wal senders and find
the minimum write and flush positions and update walsndctl with those
values. Well that's a code path I forgot to cover.

Attached is an updated patch fixing the problem you reported.
+        At any one time there will be at a number of active
synchronous standbys
+        defined by <varname>synchronous_standby_num</>; transactions waiting
It's better to use <xref linkend="guc-synchronous-standby-num">, instead.
+        for commit will be allowed to proceed after those standby servers
+        confirms receipt of their data. The synchronous standbys will be
Typo: confirms -> confirm
+       <para>
+        Specifies the number of standbys that support
+        <firstterm>synchronous replication</>, as described in
+        <xref linkend="synchronous-replication">, and listed as the first
+        elements of <xref linkend="guc-synchronous-standby-names">.
+       </para>
+       <para>
+        Default value is 1.
+       </para>
synchronous_standby_num is defined with PGC_SIGHUP. So the following
should be added into the document.

This parameter can only be set in the postgresql.conf file or on
the server command line.

The name of the parameter "synchronous_standby_num" sounds to me that
the transaction must wait for its WAL to be replicated to s_s_num standbys.
But that's not true in your patch. If s_s_names is empty, replication works
asynchronously whether the value of s_s_num is. I'm afraid that it's confusing.

The description of s_s_num is not sufficient. I'm afraid that users can easily
misunderstand that they can use quorum commit feature by using s_s_names
and s_s_num. That is, the transaction waits for its WAL to be replicated to
any s_s_num standbys listed in s_s_names.

When s_s_num is set to larger value than max_wal_senders, we should warn that?
+    for (i = 0; i < num_sync; i++)
+    {
+        volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
+
+        if (min_write_pos > walsndloc->write)
+            min_write_pos = walsndloc->write;
+        if (min_flush_pos > walsndloc->flush)
+            min_flush_pos = walsndloc->flush;
+    }
I don't think that it's safe to see those shared values without spinlock.

Regards,

--
Fujii Masao

--
Michael

Attachments:

20140815_multi_syncrep_v4.patchtext/x-patch; charset=US-ASCII; name=20140815_multi_syncrep_v4.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2586,2597 **** include_dir 'conf.d'
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at most one active synchronous standby;
!         transactions waiting for commit will be allowed to proceed after
!         this standby server confirms receipt of their data.
!         The synchronous standby will be the first standby named in this list
!         that is both currently connected and streaming data in real-time
!         (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
--- 2586,2598 ----
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at a number of active synchronous standbys
!         defined by <xref linkend="guc-synchronous-standby-num">, transactions
!         waiting for commit will be allowed to proceed after those standby
!         servers confirm receipt of their data. The synchronous standbys will be
!         the first entries named in this list that are both currently connected
!         and streaming data in real-time (as shown by a state of
!         <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
***************
*** 2627,2632 **** include_dir 'conf.d'
--- 2628,2674 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-synchronous-standby-num" xreflabel="synchronous_standby_num">
+       <term><varname>synchronous_standby_num</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_standby_num</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the number of standbys that support
+         <firstterm>synchronous replication</>.
+        </para>
+        <para>
+         Default value is 1. This parameter value cannot be higher than
+         <xref linkend="guc-max-wal-senders">.
+        </para>
+        <para>
+         Are considered as synchronous the first elements of
+         <xref linkend="guc-synchronous-standby-names"> in number of
+         <xref linkend="guc-synchronous-standby-num"> that are
+         connected. If there are more elements than the number of stansbys
+         required, all the additional standbys are potential synchronous
+         candidates. If <xref linkend="guc-synchronous-standby-names"> is
+         empty, all the standbys are asynchronous. If it is set to the
+         special entry <literal>*</>, a number of standbys equal to
+         <xref linkend="guc-synchronous-standby-names"> with the highest
+         pritority are elected as being synchronous.
+        </para>
+        <para>
+         Server will wait for commit confirmation from
+         <xref linkend="guc-synchronous-standby-num"> standbys, meaning that
+         if <xref linkend="guc-synchronous-standby-names"> has less elements
+         than the number of standbys required, server will wait indefinitely
+         for a commit confirmation.
+        </para>
+        <para>
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
        <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)
        <indexterm>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1081,1092 **** primary_slot_name = 'node_a_slot'
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first matching standby, as specified in
!     <varname>synchronous_standby_names</> on the primary, the reply
!     messages from that standby will be used to wake users waiting for
!     confirmation that the commit record has been received. These parameters
!     allow the administrator to specify which standby servers should be
!     synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
--- 1081,1092 ----
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first <varname>synchronous_standby_num</> matching
!     standbys, as specified in <varname>synchronous_standby_names</> on the
!     primary, the reply messages from that standby will be used to wake users
!     waiting for confirmation that the commit record has been received. These
!     parameters allow the administrator to specify which standby servers should
!     be synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
***************
*** 1169,1177 **** primary_slot_name = 'node_a_slot'
      The best solution for avoiding data loss is to ensure you don't lose
      your last remaining synchronous standby. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first named standby will be used as the synchronous standby. Standbys
!     listed after this will take over the role of synchronous standby if the
!     first one should fail.
     </para>
  
     <para>
--- 1169,1177 ----
      The best solution for avoiding data loss is to ensure you don't lose
      your last remaining synchronous standby. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first <varname>synchronous_standby_num</> named standbys will be used as
!     the synchronous standbys. Standbys listed after this will take over the role
!     of synchronous standby if the first one should fail.
     </para>
  
     <para>
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 5,11 ****
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the sync standby.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
--- 5,11 ----
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the synchronous standbys.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
***************
*** 59,64 ****
--- 59,65 ----
  
  /* User-settable parameters for sync rep */
  char	   *SyncRepStandbyNames;
+ int			synchronous_standby_num = 1;
  
  #define SyncStandbysDefined() \
  	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
***************
*** 206,212 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
--- 207,213 ----
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
***************
*** 223,229 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			SyncRepCancelWait();
  			break;
  		}
--- 224,230 ----
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			SyncRepCancelWait();
  			break;
  		}
***************
*** 357,365 **** SyncRepInitConfig(void)
  	}
  }
  
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standby-releases-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
--- 358,442 ----
  	}
  }
  
+ 
+ /*
+  * Obtain a palloc'd array containing positions of stanbys currently
+  * considered as synchronous. Caller is responsible for freeing the
+  * data obtained.
+  * Callers of this function should as well take a necessary lock on
+  * SyncRepLock.
+  */
+ int *
+ SyncRepGetSynchronousNodes(int *num_sync)
+ {
+ 	int	   *sync_standbys;
+ 	int		priority = 0;
+ 	int		i;
+ 
+ 	/* Make enough room */
+ 	sync_standbys = (int *) palloc(synchronous_standby_num * sizeof(int));
+ 
+ 	for (i = 0; i < max_wal_senders; i++)
+ 	{
+ 		/* Use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+ 
+ 		/* Process to next if not active */
+ 		if (walsnd->pid == 0)
+ 			continue;
+ 
+ 		/* Process to next if not streaming */
+ 		if (walsnd->state != WALSNDSTATE_STREAMING)
+ 			continue;
+ 
+ 		/* Process to next one if asynchronous */
+ 		if (walsnd->sync_standby_priority == 0)
+ 			continue;
+ 
+ 		/* Process to next one if priority conditions not satisfied */
+ 		if (priority != 0 &&
+ 			priority <= walsnd->sync_standby_priority &&
+ 			*num_sync == synchronous_standby_num)
+ 			continue;
+ 
+ 		/* Process to next one if flush position is invalid */
+ 		if (XLogRecPtrIsInvalid(walsnd->flush))
+ 			continue;
+ 
+ 		/*
+ 		 * We have a potential synchronous candidate, add it to the
+ 		 * list of nodes already present or evict the node with highest
+ 		 * priority found until now.
+ 		 */
+ 		if (*num_sync == synchronous_standby_num)
+ 		{
+ 			int j;
+ 			for (j = 0; j < *num_sync; j++)
+ 			{
+ 				volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[j]];
+ 				if (walsndloc->sync_standby_priority == priority)
+ 				{
+ 					sync_standbys[j] = i;
+ 					break;
+ 				}
+ 			}
+ 		}
+ 		else
+ 		{
+ 			sync_standbys[*num_sync] = i;
+ 			(*num_sync)++;
+ 		}
+ 
+ 		/* Update priority for next tracking */
+ 		priority = walsnd->sync_standby_priority;
+ 	}
+ 
+ 	return sync_standbys;
+ }
+ 
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standbys-release-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
***************
*** 368,378 **** void
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			priority = 0;
  	int			i;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
--- 445,458 ----
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	int		   *sync_standbys;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			num_sync = 0;
  	int			i;
+ 	bool		found = false;
+ 	XLogRecPtr	min_write_pos;
+ 	XLogRecPtr	min_flush_pos;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
***************
*** 388,454 **** SyncRepReleaseWaiters(void)
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standby. If you change this, also
! 	 * change pg_stat_get_wal_senders().
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
  
! 	for (i = 0; i < max_wal_senders; i++)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
! 
! 		if (walsnd->pid != 0 &&
! 			walsnd->state == WALSNDSTATE_STREAMING &&
! 			walsnd->sync_standby_priority > 0 &&
! 			(priority == 0 ||
! 			 priority > walsnd->sync_standby_priority) &&
! 			!XLogRecPtrIsInvalid(walsnd->flush))
  		{
! 			priority = walsnd->sync_standby_priority;
! 			syncWalSnd = walsnd;
  		}
  	}
  
  	/*
! 	 * We should have found ourselves at least.
  	 */
! 	Assert(syncWalSnd);
  
  	/*
! 	 * If we aren't managing the highest priority standby then just leave.
  	 */
! 	if (syncWalSnd != MyWalSnd)
  	{
  		LWLockRelease(SyncRepLock);
! 		announce_next_takeover = true;
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location.
  	 */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now the sync standby.
  	 */
  	if (announce_next_takeover)
  	{
--- 468,564 ----
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standbys.
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  
! 	/*
! 	 * We should have found ourselves at least.
! 	 */
! 	Assert(num_sync > 0);
! 
! 	/*
! 	 * If we aren't managing one of the standbys with highest priority
! 	 * then just leave.
! 	 */
! 	for (i = 0; i < num_sync; i++)
  	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 		if (walsndloc == MyWalSnd)
  		{
! 			found = true;
! 			break;
  		}
  	}
  
  	/*
! 	 * We are definitely not one of the chosen... But we could by
! 	 * taking the next takeover.
  	 */
! 	if (!found)
! 	{
! 		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
! 		announce_next_takeover = true;
! 		return;
! 	}
  
  	/*
! 	 * Even if we are one of the chosen standbys, leave if there
! 	 * are less synchronous standbys in waiting state than what is
! 	 * expected by the user.
  	 */
! 	if (num_sync < synchronous_standby_num)
  	{
  		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location, of course only if all the standbys found as synchronous
! 	 * have already reached that point, so first find what are the oldest
! 	 * write and flush positions of all the standbys considered in sync...
  	 */
! 	min_write_pos = MyWalSnd->write;
! 	min_flush_pos = MyWalSnd->flush;
! 	for (i = 0; i < num_sync; i++)
! 	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 
! 		SpinLockAcquire(&walsndloc->mutex);
! 		if (min_write_pos > walsndloc->write)
! 			min_write_pos = walsndloc->write;
! 		if (min_flush_pos > walsndloc->flush)
! 			min_flush_pos = walsndloc->flush;
! 		SpinLockRelease(&walsndloc->mutex);
! 	}
! 
! 	/* ... And now update if necessary */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < min_write_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = min_write_pos;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < min_flush_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = min_flush_pos;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_WRITE] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_WRITE],
! 		 numflush, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_FLUSH]);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now a sync standby.
  	 */
  	if (announce_next_takeover)
  	{
***************
*** 457,462 **** SyncRepReleaseWaiters(void)
--- 567,575 ----
  				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
  						application_name, MyWalSnd->sync_standby_priority)));
  	}
+ 
+ 	/* Clean up */
+ 	pfree(sync_standbys);
  }
  
  /*
***************
*** 694,699 **** check_synchronous_standby_names(char **newval, void **extra, GucSource source)
--- 807,836 ----
  	return true;
  }
  
+ bool
+ check_synchronous_standby_num(int *newval, void **extra, GucSource source)
+ {
+ 	/*
+ 	 * Default value is important for backward-compatibility, as well as
+ 	 * for initialization.
+ 	 */
+ 	if (*newval == 1)
+ 		return true;
+ 
+ 	/*
+ 	 * If new value is higher than max_wal_senders, enforce it to the value of
+ 	 * max_wal_senders.
+ 	 */
+ 	if (*newval > max_wal_senders)
+ 	{
+ 		GUC_check_errdetail("synchronous_standby_num cannot be higher than max_wal_senders.");
+ 		*newval = max_wal_senders;
+ 		return false;
+ 	}
+ 
+ 	return true;
+ }
+ 
  void
  assign_synchronous_commit(int newval, void *extra)
  {
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 2735,2742 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int			priority = 0;
! 	int			sync_standby = -1;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
--- 2735,2742 ----
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int		   *sync_standbys;
! 	int			num_sync = 0;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
***************
*** 2767,2802 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby. This code must match the code in SyncRepReleaseWaiters().
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		if (walsnd->pid != 0)
! 		{
! 			/*
! 			 * Treat a standby such as a pg_basebackup background process
! 			 * which always returns an invalid flush location, as an
! 			 * asynchronous standby.
! 			 */
! 			sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 				0 : walsnd->sync_standby_priority;
! 
! 			if (walsnd->state == WALSNDSTATE_STREAMING &&
! 				walsnd->sync_standby_priority > 0 &&
! 				(priority == 0 ||
! 				 priority > walsnd->sync_standby_priority) &&
! 				!XLogRecPtrIsInvalid(walsnd->flush))
! 			{
! 				priority = walsnd->sync_standby_priority;
! 				sync_standby = i;
! 			}
! 		}
  	}
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
--- 2767,2789 ----
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby.
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
+ 
+ 	/* Get first the priorities on each standby as long as we hold a lock */
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 			0 : walsnd->sync_standby_priority;
  	}
+ 
+ 	/* Obtain list of synchronous standbys */
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
***************
*** 2858,2872 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
- 			else if (i == sync_standby)
- 				values[7] = CStringGetTextDatum("sync");
  			else
! 				values[7] = CStringGetTextDatum("potential");
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
  	pfree(sync_priority);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
--- 2845,2874 ----
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
  			else
! 			{
! 				int j;
! 				bool found = false;
! 
! 				for (j = 0; j < num_sync; j++)
! 				{
! 					/* Found that this node is one in sync */
! 					if (i == sync_standbys[j])
! 					{
! 						values[7] = CStringGetTextDatum("sync");
! 						found = true;
! 						break;
! 					}
! 				}
! 				if (!found)
! 					values[7] = CStringGetTextDatum("potential");
! 			}
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
  	pfree(sync_priority);
+ 	pfree(sync_standbys);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2548,2553 **** static struct config_int ConfigureNamesInt[] =
--- 2548,2563 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"synchronous_standby_num", PGC_SIGHUP, REPLICATION_MASTER,
+ 			gettext_noop("Number of potential synchronous standbys."),
+ 			NULL
+ 		},
+ 		&synchronous_standby_num,
+ 		1, 1, INT_MAX,
+ 		check_synchronous_standby_num, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 235,240 ****
--- 235,241 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_standby_num = 1	# number of standbys servers using sync rep
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 33,38 ****
--- 33,39 ----
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
+ extern int	synchronous_standby_num;
  
  /* called by user backend */
  extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
***************
*** 49,56 **** extern void SyncRepUpdateSyncStandbysDefined(void);
--- 50,59 ----
  
  /* called by various procs */
  extern int	SyncRepWakeQueue(bool all, int mode);
+ extern int *SyncRepGetSynchronousNodes(int *num_sync);
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+ extern bool check_synchronous_standby_num(int *newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);
  
  #endif   /* _SYNCREP_H */

#12

Fujii Masao

masao.fujii@gmail.com

over 11 years ago

In reply to: Michael Paquier (#11)

Re: Support for N synchronous standby servers

On Fri, Aug 15, 2014 at 4:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Aug 14, 2014 at 8:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
+        At any one time there will be at a number of active
synchronous standbys
+        defined by <varname>synchronous_standby_num</>; transactions waiting
It's better to use <xref linkend="guc-synchronous-standby-num">, instead.
Fixed.
+        for commit will be allowed to proceed after those standby servers
+        confirms receipt of their data. The synchronous standbys will be
Typo: confirms -> confirm
Fixed.
+       <para>
+        Specifies the number of standbys that support
+        <firstterm>synchronous replication</>, as described in
+        <xref linkend="synchronous-replication">, and listed as the first
+        elements of <xref linkend="guc-synchronous-standby-names">.
+       </para>
+       <para>
+        Default value is 1.
+       </para>
synchronous_standby_num is defined with PGC_SIGHUP. So the following
should be added into the document.

This parameter can only be set in the postgresql.conf file or on
the server command line.
Fixed.

The name of the parameter "synchronous_standby_num" sounds to me that
the transaction must wait for its WAL to be replicated to s_s_num standbys.
But that's not true in your patch. If s_s_names is empty, replication works
asynchronously whether the value of s_s_num is. I'm afraid that it's confusing.
The description of s_s_num is not sufficient. I'm afraid that users can easily
misunderstand that they can use quorum commit feature by using s_s_names
and s_s_num. That is, the transaction waits for its WAL to be replicated to
any s_s_num standbys listed in s_s_names.

I reworked the docs to mention all that. Yes things are a bit
different than any quorum commit facility (how to parametrize that
simply without a parameter mapping one to one the items of
s_s_names?), as this facility relies on the order of the items of
s_s_names and the fact that stansbys are connected at a given time.

When s_s_num is set to larger value than max_wal_senders, we should warn that?

Actually I have done a bit more than that by forbidding setting
s_s_num to a value higher than max_wal_senders. Thoughts?

You added check_synchronous_standby_num() as the GUC check function for
synchronous_standby_num, and checked that there. But that seems to be wrong.
You can easily see the following error messages even if synchronous_standby_num
is smaller than max_wal_senders. The point is that synchronous_standby_num
should be located before max_wal_senders in postgresql.conf.

LOG: invalid value for parameter "synchronous_standby_num": 0
DETAIL: synchronous_standby_num cannot be higher than max_wal_senders.

Now that we discuss the interactions with other parameters. Another
thing that I am wondering about now is: what should we do if we
specify s_s_num to a number higher than the elements in s_s_names?
Currently, the patch gives the priority to s_s_num, in short if we set
s_s_num to 100, server will wait for 100 servers to confirm commit
even if there are less than 100 elements in s_s_names. I chose this
way because it looks saner particularly if s_s_names = '*'. Thoughts
once again?

I'm fine with this. As you gave an example, the number of entries in s_s_names
can be smaller than the number of actual active sync standbys. For example,
when s_s_names is set to 'hoge', more than one standbys with the name 'hoge'
can connect to the server with sync mode.

I still think that it's strange that replication can be async even when
s_s_num is larger than zero. That is, I think that the transaction must
wait for s_s_num sync standbys whether s_s_names is empty or not.
OTOH, if s_s_num is zero, replication must be async whether s_s_names
is empty or not. At least for me, it's intuitive to use s_s_num primarily
to control the sync mode. Of course, other hackers may have different
thoughts, so we need to keep our ear open for them.

In the above design, one problem is that the number of parameters
that those who want to set up only one sync replication need to change is
incremented by one. That is, they need to change s_s_num additionally.
If we are really concerned about this, we can treat a value of -1 in
s_s_num as the special value, which allows us to control sync replication
only by s_s_names as we do now. That is, if s_s_names is empty,
replication would be async. Otherwise, only one standby with
high-priority in s_s_names becomes sync one. Probably the default of
s_s_num should be -1. Thought?

The source code comments at the top of syncrep.c need to be udpated.
It's worth checking whether there are other comments to be updated.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Fujii Masao (#12)

Re: Support for N synchronous standby servers

On Fri, Aug 15, 2014 at 8:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

Now that we discuss the interactions with other parameters. Another
thing that I am wondering about now is: what should we do if we
specify s_s_num to a number higher than the elements in s_s_names?
Currently, the patch gives the priority to s_s_num, in short if we set
s_s_num to 100, server will wait for 100 servers to confirm commit
even if there are less than 100 elements in s_s_names. I chose this
way because it looks saner particularly if s_s_names = '*'. Thoughts
once again?

I'm fine with this. As you gave an example, the number of entries in s_s_names
can be smaller than the number of actual active sync standbys. For example,
when s_s_names is set to 'hoge', more than one standbys with the name 'hoge'
can connect to the server with sync mode.

This is a bit tricky. Suppose there is one standby connected which
has reached the relevant WAL position. We then lose that connection,
and a new standby connects. When or if the second standby is known to
have reached the relevant WAL position, can we release waiters? It
depends. If the old and new connections are to two different standbys
that happen to have the same name, yes. But if it's the same standby
reconnecting, then no.

I still think that it's strange that replication can be async even when
s_s_num is larger than zero. That is, I think that the transaction must
wait for s_s_num sync standbys whether s_s_names is empty or not.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Fujii Masao (#12)

1 attachment(s)

Re: Support for N synchronous standby servers

On Fri, Aug 15, 2014 at 9:28 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

You added check_synchronous_standby_num() as the GUC check function for
synchronous_standby_num, and checked that there. But that seems to be

wrong.

You can easily see the following error messages even if

synchronous_standby_num

is smaller than max_wal_senders. The point is that synchronous_standby_num
should be located before max_wal_senders in postgresql.conf.

LOG: invalid value for parameter "synchronous_standby_num": 0
DETAIL: synchronous_standby_num cannot be higher than max_wal_senders.

I am not sure what I can do here, so I am removing this check in the code,
and simply add a note in the docs that a value of _num higher than
max_wal_senders does not have much meaning.

I still think that it's strange that replication can be async even when
s_s_num is larger than zero. That is, I think that the transaction must
wait for s_s_num sync standbys whether s_s_names is empty or not.
OTOH, if s_s_num is zero, replication must be async whether s_s_names
is empty or not. At least for me, it's intuitive to use s_s_num primarily
to control the sync mode. Of course, other hackers may have different
thoughts, so we need to keep our ear open for them.

Sure, the compromise looks to be what you propose, and I am fine with that.

In the above design, one problem is that the number of parameters
that those who want to set up only one sync replication need to change is
incremented by one. That is, they need to change s_s_num additionally.
If we are really concerned about this, we can treat a value of -1 in
s_s_num as the special value, which allows us to control sync replication
only by s_s_names as we do now. That is, if s_s_names is empty,
replication would be async. Otherwise, only one standby with
high-priority in s_s_names becomes sync one. Probably the default of
s_s_num should be -1. Thought?

Taking into account those comments, attached is a patch doing the following
things depending on the values of _num and _names:
- If _num = -1 and _names is empty, all the standbys are considered as
async (same behavior as 9.1~, and default).
- If _num = -1 and _names has at least one item, wait for one standby, even
if it is not connected at the time of commit. If one node is found as sync,
other standbys listed in _names with higher priority than the sync one are
in potential state (same as existing behavior).
- If _num = 0, all the standbys are async, whatever the values in _names.
Priority is enforced to 0 for all the standbys. SyncStandbysDefined is set
to false in this case.
- If _num > 0, must wait for _num standbys whatever the values in _names
The default value of _num is -1. Documentation has been updated in
consequence.

The source code comments at the top of syncrep.c need to be udpated.
It's worth checking whether there are other comments to be updated.

Done. I have updated some comments in other places than the header.
Regards,
--
Michael

Attachments:

20140821_multi_syncrep_v5.patchtext/x-patch; charset=US-ASCII; name=20140821_multi_syncrep_v5.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2586,2597 **** include_dir 'conf.d'
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at most one active synchronous standby;
!         transactions waiting for commit will be allowed to proceed after
!         this standby server confirms receipt of their data.
!         The synchronous standby will be the first standby named in this list
!         that is both currently connected and streaming data in real-time
!         (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
--- 2586,2598 ----
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at a number of active synchronous standbys
!         defined by <xref linkend="guc-synchronous-standby-num">, transactions
!         waiting for commit will be allowed to proceed after those standby
!         servers confirm receipt of their data. The synchronous standbys will be
!         the first entries named in this list that are both currently connected
!         and streaming data in real-time (as shown by a state of
!         <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
***************
*** 2627,2632 **** include_dir 'conf.d'
--- 2628,2685 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-synchronous-standby-num" xreflabel="synchronous_standby_num">
+       <term><varname>synchronous_standby_num</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_standby_num</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the number of standbys that support
+         <firstterm>synchronous replication</>.
+        </para>
+        <para>
+         Default value is <literal>-1</>. In this case, if
+         <xref linkend="guc-synchronous-standby-names"> is empty all the
+         standby nodes are considered asynchronous. If there is at least
+         one node name defined, process will wait for one synchronous
+         standby listed.
+        </para>
+        <para>
+         When this parameter is set to <literal>0</>, all the standby
+         nodes will be considered as asynchronous.
+        </para>
+        <para>
+        This parameter value cannot be higher than
+         <xref linkend="guc-max-wal-senders">.
+        </para>
+        <para>
+         Are considered as synchronous the first elements of
+         <xref linkend="guc-synchronous-standby-names"> in number of
+         <xref linkend="guc-synchronous-standby-num"> that are
+         connected. If there are more elements than the number of stansbys
+         required, all the additional standbys are potential synchronous
+         candidates. If <xref linkend="guc-synchronous-standby-names"> is
+         empty, all the standbys are asynchronous. If it is set to the
+         special entry <literal>*</>, a number of standbys equal to
+         <xref linkend="guc-synchronous-standby-names"> with the highest
+         pritority are elected as being synchronous.
+        </para>
+        <para>
+         Server will wait for commit confirmation from
+         <xref linkend="guc-synchronous-standby-num"> standbys, meaning that
+         if <xref linkend="guc-synchronous-standby-names"> has less elements
+         than the number of standbys required, server will wait indefinitely
+         for a commit confirmation.
+        </para>
+        <para>
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
        <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)
        <indexterm>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1081,1092 **** primary_slot_name = 'node_a_slot'
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first matching standby, as specified in
!     <varname>synchronous_standby_names</> on the primary, the reply
!     messages from that standby will be used to wake users waiting for
!     confirmation that the commit record has been received. These parameters
!     allow the administrator to specify which standby servers should be
!     synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
--- 1081,1092 ----
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first <varname>synchronous_standby_num</> matching
!     standbys, as specified in <varname>synchronous_standby_names</> on the
!     primary, the reply messages from that standby will be used to wake users
!     waiting for confirmation that the commit record has been received. These
!     parameters allow the administrator to specify which standby servers should
!     be synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
***************
*** 1167,1177 **** primary_slot_name = 'node_a_slot'
  
     <para>
      The best solution for avoiding data loss is to ensure you don't lose
!     your last remaining synchronous standby. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first named standby will be used as the synchronous standby. Standbys
!     listed after this will take over the role of synchronous standby if the
!     first one should fail.
     </para>
  
     <para>
--- 1167,1177 ----
  
     <para>
      The best solution for avoiding data loss is to ensure you don't lose
!     your last remaining synchronous standbys. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first <varname>synchronous_standby_num</> named standbys will be used as
!     the synchronous standbys. Standbys listed after this will take over the role
!     of synchronous standby if the first one should fail.
     </para>
  
     <para>
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 5,11 ****
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the sync standby.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
--- 5,11 ----
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the synchronous standbys.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
***************
*** 29,39 ****
   * single ordered queue of waiting backends, so that we can avoid
   * searching the through all waiters each time we receive a reply.
   *
!  * In 9.1 we support only a single synchronous standby, chosen from a
!  * priority list of synchronous_standby_names. Before it can become the
!  * synchronous standby it must have caught up with the primary; that may
!  * take some time. Once caught up, the current highest priority standby
!  * will release waiters from the queue.
   *
   * Portions Copyright (c) 2010-2014, PostgreSQL Global Development Group
   *
--- 29,50 ----
   * single ordered queue of waiting backends, so that we can avoid
   * searching the through all waiters each time we receive a reply.
   *
!  * In 9.4 we support the possibility to have multiple synchronous standbys,
!  * whose number is defined by synchronous_standby_num, chosen from a
!  * priority list of synchronous_standby_names. Before one standby can
!  * become a synchronous standby it must have caught up with the primary;
!  * that may take some time.
!  *
!  * Waiters will be released from the queue once the number of standbys
!  * defined by synchronous_standby_num have caught.
!  *
!  * There are special cases though. If synchronous_standby_num is set to 0,
!  * all the nodes are considered as asynchronous and fastpath is out to
!  * leave this portion of the code as soon as possible. If it is set to
!  * -1, process will wait for one node to catch up with the primary only
!  * if synchronous_standby_names is non-empty. This is compatible with
!  * what has been defined in 9.1 as -1 is the default value of
!  * synchronous_standby_num.
   *
   * Portions Copyright (c) 2010-2014, PostgreSQL Global Development Group
   *
***************
*** 59,67 ****
  
  /* User-settable parameters for sync rep */
  char	   *SyncRepStandbyNames;
  
  #define SyncStandbysDefined() \
! 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
  
  static bool announce_next_takeover = true;
  
--- 70,87 ----
  
  /* User-settable parameters for sync rep */
  char	   *SyncRepStandbyNames;
+ int			synchronous_standby_num = -1;
  
+ /*
+  * Synchronous standbys are defined if there is more than
+  * one synchronous standby wanted. In default case, the list
+  * of standbys defined needs to be not empty.
+  */
  #define SyncStandbysDefined() \
! 	(synchronous_standby_num > 0 || \
! 	 (synchronous_standby_num == -1 && \
! 	  SyncRepStandbyNames != NULL && \
! 	  SyncRepStandbyNames[0] != '\0'))
  
  static bool announce_next_takeover = true;
  
***************
*** 206,212 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
--- 226,232 ----
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
***************
*** 223,229 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			SyncRepCancelWait();
  			break;
  		}
--- 243,249 ----
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			SyncRepCancelWait();
  			break;
  		}
***************
*** 357,365 **** SyncRepInitConfig(void)
  	}
  }
  
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standby-releases-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
--- 377,477 ----
  	}
  }
  
+ 
+ /*
+  * Obtain a palloc'd array containing positions of standbys currently
+  * considered as synchronous. Caller is responsible for freeing the
+  * data obtained.
+  * Callers of this function should as well take a necessary lock on
+  * SyncRepLock.
+  */
+ int *
+ SyncRepGetSynchronousNodes(int *num_sync)
+ {
+ 	int	   *sync_nodes;
+ 	int		priority = 0;
+ 	int		i;
+ 	int		allowed_sync_nodes = synchronous_standby_num;
+ 
+ 	/* Initialize */
+ 	*num_sync = 0;
+ 
+ 	/*
+ 	 * Determine the number of nodes that can be synchronized.
+ 	 * synchronous_standby_num can have the special value -1,
+ 	 * meaning that only one node with the highest non-null priority
+ 	 * can be considered as synchronous.
+ 	 */
+ 	if (synchronous_standby_num == -1)
+ 		allowed_sync_nodes = 1;
+ 
+ 	/*
+ 	 * Make enough room, there is a maximum of max_wal_senders synchronous
+ 	 * nodes as we scan though WAL senders here.
+ 	 */
+ 	sync_nodes = (int *) palloc(max_wal_senders * sizeof(int));
+ 
+ 	for (i = 0; i < max_wal_senders; i++)
+ 	{
+ 		/* Use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+ 
+ 		/* Process to next if not active */
+ 		if (walsnd->pid == 0)
+ 			continue;
+ 
+ 		/* Process to next if not streaming */
+ 		if (walsnd->state != WALSNDSTATE_STREAMING)
+ 			continue;
+ 
+ 		/* Process to next one if asynchronous */
+ 		if (walsnd->sync_standby_priority == 0)
+ 			continue;
+ 
+ 		/* Process to next one if priority conditions not satisfied */
+ 		if (priority != 0 &&
+ 			priority <= walsnd->sync_standby_priority &&
+ 			*num_sync == allowed_sync_nodes)
+ 			continue;
+ 
+ 		/* Process to next one if flush position is invalid */
+ 		if (XLogRecPtrIsInvalid(walsnd->flush))
+ 			continue;
+ 
+ 		/*
+ 		 * We have a potential synchronous candidate, add it to the
+ 		 * list of nodes already present or evict the node with highest
+ 		 * priority found until now.
+ 		 */
+ 		if (*num_sync == allowed_sync_nodes)
+ 		{
+ 			int j;
+ 			for (j = 0; j < *num_sync; j++)
+ 			{
+ 				volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_nodes[j]];
+ 				if (walsndloc->sync_standby_priority == priority)
+ 				{
+ 					sync_nodes[j] = i;
+ 					break;
+ 				}
+ 			}
+ 		}
+ 		else
+ 		{
+ 			sync_nodes[*num_sync] = i;
+ 			(*num_sync)++;
+ 		}
+ 
+ 		/* Update priority for next tracking */
+ 		priority = walsnd->sync_standby_priority;
+ 	}
+ 
+ 	return sync_nodes;
+ }
+ 
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standbys-release-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
***************
*** 368,378 **** void
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			priority = 0;
  	int			i;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
--- 480,493 ----
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	int		   *sync_standbys;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			num_sync = 0;
  	int			i;
+ 	bool		found = false;
+ 	XLogRecPtr	min_write_pos;
+ 	XLogRecPtr	min_flush_pos;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
***************
*** 388,454 **** SyncRepReleaseWaiters(void)
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standby. If you change this, also
! 	 * change pg_stat_get_wal_senders().
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
  
! 	for (i = 0; i < max_wal_senders; i++)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
! 
! 		if (walsnd->pid != 0 &&
! 			walsnd->state == WALSNDSTATE_STREAMING &&
! 			walsnd->sync_standby_priority > 0 &&
! 			(priority == 0 ||
! 			 priority > walsnd->sync_standby_priority) &&
! 			!XLogRecPtrIsInvalid(walsnd->flush))
  		{
! 			priority = walsnd->sync_standby_priority;
! 			syncWalSnd = walsnd;
  		}
  	}
  
  	/*
! 	 * We should have found ourselves at least.
  	 */
! 	Assert(syncWalSnd);
  
  	/*
! 	 * If we aren't managing the highest priority standby then just leave.
  	 */
! 	if (syncWalSnd != MyWalSnd)
  	{
  		LWLockRelease(SyncRepLock);
! 		announce_next_takeover = true;
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location.
  	 */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now the sync standby.
  	 */
  	if (announce_next_takeover)
  	{
--- 503,601 ----
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standbys.
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  
! 	/*
! 	 * We should have found ourselves at least, except if it is not expected
! 	 * to find any synchronous nodes.
! 	 */
! 	Assert(num_sync > 0);
! 
! 	/*
! 	 * If we aren't managing one of the standbys with highest priority
! 	 * then just leave.
! 	 */
! 	for (i = 0; i < num_sync; i++)
  	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 		if (walsndloc == MyWalSnd)
  		{
! 			found = true;
! 			break;
  		}
  	}
  
  	/*
! 	 * We are definitely not one of the chosen... But we could by
! 	 * taking the next takeover.
  	 */
! 	if (!found)
! 	{
! 		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
! 		announce_next_takeover = true;
! 		return;
! 	}
  
  	/*
! 	 * Even if we are one of the chosen standbys, leave if there
! 	 * are less synchronous standbys in waiting state than what is
! 	 * expected by the user.
  	 */
! 	if (num_sync < synchronous_standby_num &&
! 		synchronous_standby_num != -1)
  	{
  		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location, of course only if all the standbys found as synchronous
! 	 * have already reached that point, so first find what are the oldest
! 	 * write and flush positions of all the standbys considered in sync...
  	 */
! 	min_write_pos = MyWalSnd->write;
! 	min_flush_pos = MyWalSnd->flush;
! 	for (i = 0; i < num_sync; i++)
! 	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 
! 		SpinLockAcquire(&walsndloc->mutex);
! 		if (min_write_pos > walsndloc->write)
! 			min_write_pos = walsndloc->write;
! 		if (min_flush_pos > walsndloc->flush)
! 			min_flush_pos = walsndloc->flush;
! 		SpinLockRelease(&walsndloc->mutex);
! 	}
! 
! 	/* ... And now update if necessary */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < min_write_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = min_write_pos;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < min_flush_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = min_flush_pos;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_WRITE] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_WRITE],
! 		 numflush, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_FLUSH]);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now a sync standby.
  	 */
  	if (announce_next_takeover)
  	{
***************
*** 457,462 **** SyncRepReleaseWaiters(void)
--- 604,612 ----
  				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
  						application_name, MyWalSnd->sync_standby_priority)));
  	}
+ 
+ 	/* Clean up */
+ 	pfree(sync_standbys);
  }
  
  /*
***************
*** 483,488 **** SyncRepGetStandbyPriority(void)
--- 633,642 ----
  	if (am_cascading_walsender)
  		return 0;
  
+ 	/* If no synchronous nodes allowed, no cake for this WAL sender */
+ 	if (synchronous_standby_num == 0)
+ 		return 0;
+ 
  	/* Need a modifiable copy of string */
  	rawstring = pstrdup(SyncRepStandbyNames);
  
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 2735,2742 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int			priority = 0;
! 	int			sync_standby = -1;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
--- 2735,2742 ----
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int		   *sync_standbys;
! 	int			num_sync = 0;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
***************
*** 2767,2802 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby. This code must match the code in SyncRepReleaseWaiters().
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		if (walsnd->pid != 0)
! 		{
! 			/*
! 			 * Treat a standby such as a pg_basebackup background process
! 			 * which always returns an invalid flush location, as an
! 			 * asynchronous standby.
! 			 */
! 			sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 				0 : walsnd->sync_standby_priority;
! 
! 			if (walsnd->state == WALSNDSTATE_STREAMING &&
! 				walsnd->sync_standby_priority > 0 &&
! 				(priority == 0 ||
! 				 priority > walsnd->sync_standby_priority) &&
! 				!XLogRecPtrIsInvalid(walsnd->flush))
! 			{
! 				priority = walsnd->sync_standby_priority;
! 				sync_standby = i;
! 			}
! 		}
  	}
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
--- 2767,2789 ----
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby.
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
+ 
+ 	/* Get first the priorities on each standby as long as we hold a lock */
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 			0 : walsnd->sync_standby_priority;
  	}
+ 
+ 	/* Obtain list of synchronous standbys */
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
***************
*** 2858,2872 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
- 			else if (i == sync_standby)
- 				values[7] = CStringGetTextDatum("sync");
  			else
! 				values[7] = CStringGetTextDatum("potential");
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
  	pfree(sync_priority);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
--- 2845,2876 ----
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
  			else
! 			{
! 				int j;
! 				bool found = false;
! 
! 				for (j = 0; j < num_sync; j++)
! 				{
! 					/* Found that this node is one in sync */
! 					if (i == sync_standbys[j])
! 					{
! 						values[7] = CStringGetTextDatum("sync");
! 						found = true;
! 						break;
! 					}
! 				}
! 				if (!found)
! 					values[7] = CStringGetTextDatum("potential");
! 			}
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
+ 
+ 	/* Cleanup */
  	pfree(sync_priority);
+ 	pfree(sync_standbys);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2548,2553 **** static struct config_int ConfigureNamesInt[] =
--- 2548,2563 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"synchronous_standby_num", PGC_SIGHUP, REPLICATION_MASTER,
+ 			gettext_noop("Number of potential synchronous standbys."),
+ 			NULL
+ 		},
+ 		&synchronous_standby_num,
+ 		-1, -1, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 235,240 ****
--- 235,241 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_standby_num = -1	# number of standbys servers using sync rep
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 33,38 ****
--- 33,39 ----
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
+ extern int	synchronous_standby_num;
  
  /* called by user backend */
  extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
***************
*** 49,54 **** extern void SyncRepUpdateSyncStandbysDefined(void);
--- 50,56 ----
  
  /* called by various procs */
  extern int	SyncRepWakeQueue(bool all, int mode);
+ extern int *SyncRepGetSynchronousNodes(int *num_sync);
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);

#15

Rajeev rastogi

rajeev.rastogi@huawei.com

over 11 years ago

In reply to: Michael Paquier (#1)

Re: Support for N synchronous standby servers

On 09 August 2014 11:33, Michael Paquier Wrote:

Please find attached a patch to add support of synchronous replication
for multiple standby servers. This is controlled by the addition of a
new GUC parameter called synchronous_standby_num, that makes server
wait for transaction commit on the first N standbys defined in
synchronous_standby_names. The implementation is really straight-
forward, and has just needed a couple of modifications in walsender.c
for pg_stat_get_wal_senders and syncrep.c.

I have just started looking into this patch.
Please find below my first level of observation from the patch:

1. Allocation of memory for sync_nodes in function SyncRepGetSynchronousNodes should be equivalent to allowed_sync_nodes instead of max_wal_senders. As anyway we are not going to store sync stdbys more than allowed_sync_nodes.
sync_nodes = (int *) palloc(allowed_sync_nodes * sizeof(int));

2. Logic of deciding the highest priority one seems to be in-correct.
Assume, s_s_num = 3, s_s_names = 3,4,2,1
standby nodes are in order as: 1,2,3,4,5,6,7

As per the logic in patch, node 4 with priority 2 will not be added in the list whereas 1,2,3 will be added.

The problem is because priority updated for next tracking is not the highest priority as of that iteration, it is just priority of last node added to the list. So it may happen that a node with higher priority is still there in list but we are comparing with some other smaller priority.

3. Can we optimize the function SyncRepGetSynchronousNodes in such a way that it gets the number of standby nodes from s_s_names itself. With this it will be usful to stop scanning the moment we get first s_s_num potential standbys.

Thanks and Regards,
Kumar Rajeev Rastogi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Rajeev rastogi (#15)

1 attachment(s)

Re: Support for N synchronous standby servers

On Fri, Aug 22, 2014 at 7:14 PM, Rajeev rastogi <rajeev.rastogi@huawei.com>
wrote:

I have just started looking into this patch.
Please find below my first level of observation from the patch:

Thanks! Updated patch attached.

1. Allocation of memory for sync_nodes in function
SyncRepGetSynchronousNodes should be equivalent to allowed_sync_nodes
instead of max_wal_senders. As anyway we are not going to store sync stdbys
more than allowed_sync_nodes.
sync_nodes = (int *) palloc(allowed_sync_nodes *
sizeof(int));

Fixed.

2. Logic of deciding the highest priority one seems to be in-correct.

Assume, s_s_num = 3, s_s_names = 3,4,2,1
standby nodes are in order as: 1,2,3,4,5,6,7

As per the logic in patch, node 4 with priority 2 will not be
added in the list whereas 1,2,3 will be added.

The problem is because priority updated for next tracking is not
the highest priority as of that iteration, it is just priority of last node
added to the list. So it may happen that a node with higher priority
is still there in list but we are comparing with some other smaller
priority.

Fixed. Nice catch!

3. Can we optimize the function SyncRepGetSynchronousNodes in such a way
that it gets the number of standby nodes from s_s_names itself. With this
it will be usful to stop scanning the moment we get first s_s_num potential
standbys.

By doing so, we would need to scan the WAL sender array more than once (or
once if we can find N sync nodes with a name matching the first entry, smth
unlikely to happen). We would need as well to recalculate for a given item
in the list _names what is its priority and compare it with the existing
entries in the WAL sender list. So this is not worth the shot.
Also, using the priority instead of s_s_names is more solid as s_s_names is
now used only in SyncRepGetStandbyPriority to calculate the priority for a
given WAL sender, and is a function only called by a WAL sender itself when
it initializes.
Regards,
--
Michael

Attachments:

20140822_multi_syncrep_v6.patchtext/x-patch; charset=US-ASCII; name=20140822_multi_syncrep_v6.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2586,2597 **** include_dir 'conf.d'
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at most one active synchronous standby;
!         transactions waiting for commit will be allowed to proceed after
!         this standby server confirms receipt of their data.
!         The synchronous standby will be the first standby named in this list
!         that is both currently connected and streaming data in real-time
!         (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
--- 2586,2598 ----
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
          <xref linkend="synchronous-replication">.
!         At any one time there will be at a number of active synchronous standbys
!         defined by <xref linkend="guc-synchronous-standby-num">, transactions
!         waiting for commit will be allowed to proceed after those standby
!         servers confirm receipt of their data. The synchronous standbys will be
!         the first entries named in this list that are both currently connected
!         and streaming data in real-time (as shown by a state of
!         <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
***************
*** 2627,2632 **** include_dir 'conf.d'
--- 2628,2685 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-synchronous-standby-num" xreflabel="synchronous_standby_num">
+       <term><varname>synchronous_standby_num</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_standby_num</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the number of standbys that support
+         <firstterm>synchronous replication</>.
+        </para>
+        <para>
+         Default value is <literal>-1</>. In this case, if
+         <xref linkend="guc-synchronous-standby-names"> is empty all the
+         standby nodes are considered asynchronous. If there is at least
+         one node name defined, process will wait for one synchronous
+         standby listed.
+        </para>
+        <para>
+         When this parameter is set to <literal>0</>, all the standby
+         nodes will be considered as asynchronous.
+        </para>
+        <para>
+        This parameter value cannot be higher than
+         <xref linkend="guc-max-wal-senders">.
+        </para>
+        <para>
+         Are considered as synchronous the first elements of
+         <xref linkend="guc-synchronous-standby-names"> in number of
+         <xref linkend="guc-synchronous-standby-num"> that are
+         connected. If there are more elements than the number of stansbys
+         required, all the additional standbys are potential synchronous
+         candidates. If <xref linkend="guc-synchronous-standby-names"> is
+         empty, all the standbys are asynchronous. If it is set to the
+         special entry <literal>*</>, a number of standbys equal to
+         <xref linkend="guc-synchronous-standby-names"> with the highest
+         pritority are elected as being synchronous.
+        </para>
+        <para>
+         Server will wait for commit confirmation from
+         <xref linkend="guc-synchronous-standby-num"> standbys, meaning that
+         if <xref linkend="guc-synchronous-standby-names"> has less elements
+         than the number of standbys required, server will wait indefinitely
+         for a commit confirmation.
+        </para>
+        <para>
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
        <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)
        <indexterm>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1081,1092 **** primary_slot_name = 'node_a_slot'
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first matching standby, as specified in
!     <varname>synchronous_standby_names</> on the primary, the reply
!     messages from that standby will be used to wake users waiting for
!     confirmation that the commit record has been received. These parameters
!     allow the administrator to specify which standby servers should be
!     synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
--- 1081,1092 ----
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first <varname>synchronous_standby_num</> matching
!     standbys, as specified in <varname>synchronous_standby_names</> on the
!     primary, the reply messages from that standby will be used to wake users
!     waiting for confirmation that the commit record has been received. These
!     parameters allow the administrator to specify which standby servers should
!     be synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
***************
*** 1167,1177 **** primary_slot_name = 'node_a_slot'
  
     <para>
      The best solution for avoiding data loss is to ensure you don't lose
!     your last remaining synchronous standby. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first named standby will be used as the synchronous standby. Standbys
!     listed after this will take over the role of synchronous standby if the
!     first one should fail.
     </para>
  
     <para>
--- 1167,1177 ----
  
     <para>
      The best solution for avoiding data loss is to ensure you don't lose
!     your last remaining synchronous standbys. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first <varname>synchronous_standby_num</> named standbys will be used as
!     the synchronous standbys. Standbys listed after this will take over the role
!     of synchronous standby if the first one should fail.
     </para>
  
     <para>
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 5,11 ****
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the sync standby.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
--- 5,11 ----
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the synchronous standbys.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
***************
*** 29,39 ****
   * single ordered queue of waiting backends, so that we can avoid
   * searching the through all waiters each time we receive a reply.
   *
!  * In 9.1 we support only a single synchronous standby, chosen from a
!  * priority list of synchronous_standby_names. Before it can become the
!  * synchronous standby it must have caught up with the primary; that may
!  * take some time. Once caught up, the current highest priority standby
!  * will release waiters from the queue.
   *
   * Portions Copyright (c) 2010-2014, PostgreSQL Global Development Group
   *
--- 29,50 ----
   * single ordered queue of waiting backends, so that we can avoid
   * searching the through all waiters each time we receive a reply.
   *
!  * In 9.4 we support the possibility to have multiple synchronous standbys,
!  * whose number is defined by synchronous_standby_num, chosen from a
!  * priority list of synchronous_standby_names. Before one standby can
!  * become a synchronous standby it must have caught up with the primary;
!  * that may take some time.
!  *
!  * Waiters will be released from the queue once the number of standbys
!  * defined by synchronous_standby_num have caught.
!  *
!  * There are special cases though. If synchronous_standby_num is set to 0,
!  * all the nodes are considered as asynchronous and fastpath is out to
!  * leave this portion of the code as soon as possible. If it is set to
!  * -1, process will wait for one node to catch up with the primary only
!  * if synchronous_standby_names is non-empty. This is compatible with
!  * what has been defined in 9.1 as -1 is the default value of
!  * synchronous_standby_num.
   *
   * Portions Copyright (c) 2010-2014, PostgreSQL Global Development Group
   *
***************
*** 59,67 ****
  
  /* User-settable parameters for sync rep */
  char	   *SyncRepStandbyNames;
  
  #define SyncStandbysDefined() \
! 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
  
  static bool announce_next_takeover = true;
  
--- 70,87 ----
  
  /* User-settable parameters for sync rep */
  char	   *SyncRepStandbyNames;
+ int			synchronous_standby_num = -1;
  
+ /*
+  * Synchronous standbys are defined if there is more than
+  * one synchronous standby wanted. In default case, the list
+  * of standbys defined needs to be not empty.
+  */
  #define SyncStandbysDefined() \
! 	(synchronous_standby_num > 0 || \
! 	 (synchronous_standby_num == -1 && \
! 	  SyncRepStandbyNames != NULL && \
! 	  SyncRepStandbyNames[0] != '\0'))
  
  static bool announce_next_takeover = true;
  
***************
*** 206,212 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
--- 226,232 ----
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
***************
*** 223,229 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			SyncRepCancelWait();
  			break;
  		}
--- 243,249 ----
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			SyncRepCancelWait();
  			break;
  		}
***************
*** 357,365 **** SyncRepInitConfig(void)
  	}
  }
  
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standby-releases-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
--- 377,483 ----
  	}
  }
  
+ 
+ /*
+  * Obtain a palloc'd array containing positions of standbys currently
+  * considered as synchronous. Caller is responsible for freeing the
+  * data obtained and should as well take a necessary lock on SyncRepLock.
+  */
+ int *
+ SyncRepGetSynchronousNodes(int *num_sync)
+ {
+ 	int	   *sync_nodes;
+ 	int		priority = 0;
+ 	int		i;
+ 	int		allowed_sync_nodes = synchronous_standby_num;
+ 
+ 	/* Initialize */
+ 	*num_sync = 0;
+ 
+ 	/* Leave if no synchronous nodes allowed */
+ 	if (synchronous_standby_num == 0)
+ 		return NULL;
+ 
+ 	/*
+ 	 * Determine the number of nodes that can be synchronized.
+ 	 * synchronous_standby_num can have the special value -1,
+ 	 * meaning that only one node with the highest non-null priority
+ 	 * can be considered as synchronous.
+ 	 */
+ 	if (synchronous_standby_num == -1)
+ 		allowed_sync_nodes = 1;
+ 
+ 	/*
+ 	 * Make enough room, there is a maximum of max_wal_senders synchronous
+ 	 * nodes as we scan though WAL senders here.
+ 	 */
+ 	sync_nodes = (int *) palloc(allowed_sync_nodes * sizeof(int));
+ 
+ 	for (i = 0; i < max_wal_senders; i++)
+ 	{
+ 		/* Use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+ 
+ 		/* Process to next if not active */
+ 		if (walsnd->pid == 0)
+ 			continue;
+ 
+ 		/* Process to next if not streaming */
+ 		if (walsnd->state != WALSNDSTATE_STREAMING)
+ 			continue;
+ 
+ 		/* Process to next one if asynchronous */
+ 		if (walsnd->sync_standby_priority == 0)
+ 			continue;
+ 
+ 		/* Process to next one if priority conditions not satisfied */
+ 		if (priority != 0 &&
+ 			priority <= walsnd->sync_standby_priority &&
+ 			*num_sync == allowed_sync_nodes)
+ 			continue;
+ 
+ 		/* Process to next one if flush position is invalid */
+ 		if (XLogRecPtrIsInvalid(walsnd->flush))
+ 			continue;
+ 
+ 		/*
+ 		 * We have a potential synchronous candidate, add it to the
+ 		 * list of nodes already present or evict the node with highest
+ 		 * priority found until now.
+ 		 */
+ 		if (*num_sync == allowed_sync_nodes)
+ 		{
+ 			int j;
+ 			for (j = 0; j < *num_sync; j++)
+ 			{
+ 				volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_nodes[j]];
+ 				if (walsndloc->sync_standby_priority == priority)
+ 				{
+ 					sync_nodes[j] = i;
+ 					break;
+ 				}
+ 			}
+ 		}
+ 		else
+ 		{
+ 			sync_nodes[*num_sync] = i;
+ 			(*num_sync)++;
+ 		}
+ 
+ 		/*
+ 		 * Update priority for next tracking. This needs to be the highest
+ 		 * priority value in all the existing items.
+ 		 */
+ 		if (priority < walsnd->sync_standby_priority)
+ 			priority = walsnd->sync_standby_priority;
+ 	}
+ 
+ 	return sync_nodes;
+ }
+ 
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standbys-release-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
***************
*** 368,378 **** void
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			priority = 0;
  	int			i;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
--- 486,499 ----
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	int		   *sync_standbys;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			num_sync = 0;
  	int			i;
+ 	bool		found = false;
+ 	XLogRecPtr	min_write_pos;
+ 	XLogRecPtr	min_flush_pos;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
***************
*** 388,454 **** SyncRepReleaseWaiters(void)
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standby. If you change this, also
! 	 * change pg_stat_get_wal_senders().
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
  
! 	for (i = 0; i < max_wal_senders; i++)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
! 
! 		if (walsnd->pid != 0 &&
! 			walsnd->state == WALSNDSTATE_STREAMING &&
! 			walsnd->sync_standby_priority > 0 &&
! 			(priority == 0 ||
! 			 priority > walsnd->sync_standby_priority) &&
! 			!XLogRecPtrIsInvalid(walsnd->flush))
  		{
! 			priority = walsnd->sync_standby_priority;
! 			syncWalSnd = walsnd;
  		}
  	}
  
  	/*
! 	 * We should have found ourselves at least.
  	 */
! 	Assert(syncWalSnd);
  
  	/*
! 	 * If we aren't managing the highest priority standby then just leave.
  	 */
! 	if (syncWalSnd != MyWalSnd)
  	{
  		LWLockRelease(SyncRepLock);
! 		announce_next_takeover = true;
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location.
  	 */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now the sync standby.
  	 */
  	if (announce_next_takeover)
  	{
--- 509,607 ----
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standbys.
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  
! 	/*
! 	 * We should have found ourselves at least, except if it is not expected
! 	 * to find any synchronous nodes.
! 	 */
! 	Assert(num_sync > 0);
! 
! 	/*
! 	 * If we aren't managing one of the standbys with highest priority
! 	 * then just leave.
! 	 */
! 	for (i = 0; i < num_sync; i++)
  	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 		if (walsndloc == MyWalSnd)
  		{
! 			found = true;
! 			break;
  		}
  	}
  
  	/*
! 	 * We are definitely not one of the chosen... But we could by
! 	 * taking the next takeover.
  	 */
! 	if (!found)
! 	{
! 		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
! 		announce_next_takeover = true;
! 		return;
! 	}
  
  	/*
! 	 * Even if we are one of the chosen standbys, leave if there
! 	 * are less synchronous standbys in waiting state than what is
! 	 * expected by the user.
  	 */
! 	if (num_sync < synchronous_standby_num &&
! 		synchronous_standby_num != -1)
  	{
  		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location, of course only if all the standbys found as synchronous
! 	 * have already reached that point, so first find what are the oldest
! 	 * write and flush positions of all the standbys considered in sync...
  	 */
! 	min_write_pos = MyWalSnd->write;
! 	min_flush_pos = MyWalSnd->flush;
! 	for (i = 0; i < num_sync; i++)
! 	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 
! 		SpinLockAcquire(&walsndloc->mutex);
! 		if (min_write_pos > walsndloc->write)
! 			min_write_pos = walsndloc->write;
! 		if (min_flush_pos > walsndloc->flush)
! 			min_flush_pos = walsndloc->flush;
! 		SpinLockRelease(&walsndloc->mutex);
! 	}
! 
! 	/* ... And now update if necessary */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < min_write_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = min_write_pos;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < min_flush_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = min_flush_pos;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_WRITE] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_WRITE],
! 		 numflush, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_FLUSH]);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now a sync standby.
  	 */
  	if (announce_next_takeover)
  	{
***************
*** 457,462 **** SyncRepReleaseWaiters(void)
--- 610,618 ----
  				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
  						application_name, MyWalSnd->sync_standby_priority)));
  	}
+ 
+ 	/* Clean up */
+ 	pfree(sync_standbys);
  }
  
  /*
***************
*** 483,488 **** SyncRepGetStandbyPriority(void)
--- 639,648 ----
  	if (am_cascading_walsender)
  		return 0;
  
+ 	/* If no synchronous nodes allowed, no cake for this WAL sender */
+ 	if (synchronous_standby_num == 0)
+ 		return 0;
+ 
  	/* Need a modifiable copy of string */
  	rawstring = pstrdup(SyncRepStandbyNames);
  
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 2735,2742 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int			priority = 0;
! 	int			sync_standby = -1;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
--- 2735,2742 ----
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int		   *sync_standbys;
! 	int			num_sync = 0;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
***************
*** 2767,2802 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby. This code must match the code in SyncRepReleaseWaiters().
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		if (walsnd->pid != 0)
! 		{
! 			/*
! 			 * Treat a standby such as a pg_basebackup background process
! 			 * which always returns an invalid flush location, as an
! 			 * asynchronous standby.
! 			 */
! 			sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 				0 : walsnd->sync_standby_priority;
! 
! 			if (walsnd->state == WALSNDSTATE_STREAMING &&
! 				walsnd->sync_standby_priority > 0 &&
! 				(priority == 0 ||
! 				 priority > walsnd->sync_standby_priority) &&
! 				!XLogRecPtrIsInvalid(walsnd->flush))
! 			{
! 				priority = walsnd->sync_standby_priority;
! 				sync_standby = i;
! 			}
! 		}
  	}
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
--- 2767,2789 ----
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby.
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
+ 
+ 	/* Get first the priorities on each standby as long as we hold a lock */
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 			0 : walsnd->sync_standby_priority;
  	}
+ 
+ 	/* Obtain list of synchronous standbys */
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
***************
*** 2858,2872 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
- 			else if (i == sync_standby)
- 				values[7] = CStringGetTextDatum("sync");
  			else
! 				values[7] = CStringGetTextDatum("potential");
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
  	pfree(sync_priority);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
--- 2845,2876 ----
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
  			else
! 			{
! 				int j;
! 				bool found = false;
! 
! 				for (j = 0; j < num_sync; j++)
! 				{
! 					/* Found that this node is one in sync */
! 					if (i == sync_standbys[j])
! 					{
! 						values[7] = CStringGetTextDatum("sync");
! 						found = true;
! 						break;
! 					}
! 				}
! 				if (!found)
! 					values[7] = CStringGetTextDatum("potential");
! 			}
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
+ 
+ 	/* Cleanup */
  	pfree(sync_priority);
+ 	pfree(sync_standbys);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2548,2553 **** static struct config_int ConfigureNamesInt[] =
--- 2548,2563 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"synchronous_standby_num", PGC_SIGHUP, REPLICATION_MASTER,
+ 			gettext_noop("Number of potential synchronous standbys."),
+ 			NULL
+ 		},
+ 		&synchronous_standby_num,
+ 		-1, -1, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 235,240 ****
--- 235,241 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_standby_num = -1	# number of standbys servers using sync rep
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 33,38 ****
--- 33,39 ----
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
+ extern int	synchronous_standby_num;
  
  /* called by user backend */
  extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
***************
*** 49,54 **** extern void SyncRepUpdateSyncStandbysDefined(void);
--- 50,56 ----
  
  /* called by various procs */
  extern int	SyncRepWakeQueue(bool all, int mode);
+ extern int *SyncRepGetSynchronousNodes(int *num_sync);
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);

#17

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Michael Paquier (#16)

1 attachment(s)

Re: Support for N synchronous standby servers

On Fri, Aug 22, 2014 at 11:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

2. Logic of deciding the highest priority one seems to be in-correct.
Assume, s_s_num = 3, s_s_names = 3,4,2,1
standby nodes are in order as: 1,2,3,4,5,6,7

As per the logic in patch, node 4 with priority 2 will not be added in the list whereas 1,2,3 will be added.

The problem is because priority updated for next tracking is not the highest priority as of that iteration, it is just priority of last node added to the list. So it may happen that a node with higher priority is still there in list but we are comparing with some other smaller priority.

Fixed. Nice catch!

Actually by re-reading the code I wrote yesterday I found that the fix
in v6 for that is not correct. That's really fixed with v7 attached.
Regards,
--
Michael

Attachments:

20140823_multi_syncrep_v7.patchtext/x-patch; charset=US-ASCII; name=20140823_multi_syncrep_v7.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f23e5dc..d085f48 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2586,12 +2586,13 @@ include_dir 'conf.d'
         Specifies a comma-separated list of standby names that can support
         <firstterm>synchronous replication</>, as described in
         <xref linkend="synchronous-replication">.
-        At any one time there will be at most one active synchronous standby;
-        transactions waiting for commit will be allowed to proceed after
-        this standby server confirms receipt of their data.
-        The synchronous standby will be the first standby named in this list
-        that is both currently connected and streaming data in real-time
-        (as shown by a state of <literal>streaming</literal> in the
+        At any one time there will be at a number of active synchronous standbys
+        defined by <xref linkend="guc-synchronous-standby-num">, transactions
+        waiting for commit will be allowed to proceed after those standby
+        servers confirm receipt of their data. The synchronous standbys will be
+        the first entries named in this list that are both currently connected
+        and streaming data in real-time (as shown by a state of
+        <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
         Other standby servers appearing later in this list represent potential
@@ -2627,6 +2628,58 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-synchronous-standby-num" xreflabel="synchronous_standby_num">
+      <term><varname>synchronous_standby_num</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>synchronous_standby_num</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the number of standbys that support
+        <firstterm>synchronous replication</>.
+       </para>
+       <para>
+        Default value is <literal>-1</>. In this case, if
+        <xref linkend="guc-synchronous-standby-names"> is empty all the
+        standby nodes are considered asynchronous. If there is at least
+        one node name defined, process will wait for one synchronous
+        standby listed.
+       </para>
+       <para>
+        When this parameter is set to <literal>0</>, all the standby
+        nodes will be considered as asynchronous.
+       </para>
+       <para>
+       This parameter value cannot be higher than
+        <xref linkend="guc-max-wal-senders">.
+       </para>
+       <para>
+        Are considered as synchronous the first elements of
+        <xref linkend="guc-synchronous-standby-names"> in number of
+        <xref linkend="guc-synchronous-standby-num"> that are
+        connected. If there are more elements than the number of stansbys
+        required, all the additional standbys are potential synchronous
+        candidates. If <xref linkend="guc-synchronous-standby-names"> is
+        empty, all the standbys are asynchronous. If it is set to the
+        special entry <literal>*</>, a number of standbys equal to
+        <xref linkend="guc-synchronous-standby-names"> with the highest
+        pritority are elected as being synchronous.
+       </para>
+       <para>
+        Server will wait for commit confirmation from
+        <xref linkend="guc-synchronous-standby-num"> standbys, meaning that
+        if <xref linkend="guc-synchronous-standby-names"> has less elements
+        than the number of standbys required, server will wait indefinitely
+        for a commit confirmation.
+       </para>
+       <para>
+        This parameter can only be set in the <filename>postgresql.conf</>
+        file or on the server command line.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
       <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index d249959..ec0ea70 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1081,12 +1081,12 @@ primary_slot_name = 'node_a_slot'
     WAL record is then sent to the standby. The standby sends reply
     messages each time a new batch of WAL data is written to disk, unless
     <varname>wal_receiver_status_interval</> is set to zero on the standby.
-    If the standby is the first matching standby, as specified in
-    <varname>synchronous_standby_names</> on the primary, the reply
-    messages from that standby will be used to wake users waiting for
-    confirmation that the commit record has been received. These parameters
-    allow the administrator to specify which standby servers should be
-    synchronous standbys. Note that the configuration of synchronous
+    If the standby is the first <varname>synchronous_standby_num</> matching
+    standbys, as specified in <varname>synchronous_standby_names</> on the
+    primary, the reply messages from that standby will be used to wake users
+    waiting for confirmation that the commit record has been received. These
+    parameters allow the administrator to specify which standby servers should
+    be synchronous standbys. Note that the configuration of synchronous
     replication is mainly on the master. Named standbys must be directly
     connected to the master; the master knows nothing about downstream
     standby servers using cascaded replication.
@@ -1167,11 +1167,11 @@ primary_slot_name = 'node_a_slot'
 
    <para>
     The best solution for avoiding data loss is to ensure you don't lose
-    your last remaining synchronous standby. This can be achieved by naming multiple
+    your last remaining synchronous standbys. This can be achieved by naming multiple
     potential synchronous standbys using <varname>synchronous_standby_names</>.
-    The first named standby will be used as the synchronous standby. Standbys
-    listed after this will take over the role of synchronous standby if the
-    first one should fail.
+    The first <varname>synchronous_standby_num</> named standbys will be used as
+    the synchronous standbys. Standbys listed after this will take over the role
+    of synchronous standby if the first one should fail.
    </para>
 
    <para>
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index aa54bfb..ddfd36b 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -5,7 +5,7 @@
  * Synchronous replication is new as of PostgreSQL 9.1.
  *
  * If requested, transaction commits wait until their commit LSN is
- * acknowledged by the sync standby.
+ * acknowledged by the synchronous standbys.
  *
  * This module contains the code for waiting and release of backends.
  * All code in this module executes on the primary. The core streaming
@@ -29,11 +29,22 @@
  * single ordered queue of waiting backends, so that we can avoid
  * searching the through all waiters each time we receive a reply.
  *
- * In 9.1 we support only a single synchronous standby, chosen from a
- * priority list of synchronous_standby_names. Before it can become the
- * synchronous standby it must have caught up with the primary; that may
- * take some time. Once caught up, the current highest priority standby
- * will release waiters from the queue.
+ * In 9.4 we support the possibility to have multiple synchronous standbys,
+ * whose number is defined by synchronous_standby_num, chosen from a
+ * priority list of synchronous_standby_names. Before one standby can
+ * become a synchronous standby it must have caught up with the primary;
+ * that may take some time.
+ *
+ * Waiters will be released from the queue once the number of standbys
+ * defined by synchronous_standby_num have caught.
+ *
+ * There are special cases though. If synchronous_standby_num is set to 0,
+ * all the nodes are considered as asynchronous and fastpath is out to
+ * leave this portion of the code as soon as possible. If it is set to
+ * -1, process will wait for one node to catch up with the primary only
+ * if synchronous_standby_names is non-empty. This is compatible with
+ * what has been defined in 9.1 as -1 is the default value of
+ * synchronous_standby_num.
  *
  * Portions Copyright (c) 2010-2014, PostgreSQL Global Development Group
  *
@@ -59,9 +70,18 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+int			synchronous_standby_num = -1;
 
+/*
+ * Synchronous standbys are defined if there is more than
+ * one synchronous standby wanted. In default case, the list
+ * of standbys defined needs to be not empty.
+ */
 #define SyncStandbysDefined() \
-	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
+	(synchronous_standby_num > 0 || \
+	 (synchronous_standby_num == -1 && \
+	  SyncRepStandbyNames != NULL && \
+	  SyncRepStandbyNames[0] != '\0'))
 
 static bool announce_next_takeover = true;
 
@@ -206,7 +226,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			ereport(WARNING,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
 			whereToSendOutput = DestNone;
 			SyncRepCancelWait();
 			break;
@@ -223,7 +243,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			QueryCancelPending = false;
 			ereport(WARNING,
 					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
 			SyncRepCancelWait();
 			break;
 		}
@@ -357,9 +377,117 @@ SyncRepInitConfig(void)
 	}
 }
 
+
+/*
+ * Obtain a palloc'd array containing positions of standbys currently
+ * considered as synchronous. Caller is responsible for freeing the
+ * data obtained and should as well take a necessary lock on SyncRepLock.
+ */
+int *
+SyncRepGetSynchronousNodes(int *num_sync)
+{
+	int	   *sync_nodes;
+	int		priority = 0;
+	int		i;
+	int		allowed_sync_nodes = synchronous_standby_num;
+
+	/* Initialize */
+	*num_sync = 0;
+
+	/* Leave if no synchronous nodes allowed */
+	if (synchronous_standby_num == 0)
+		return NULL;
+
+	/*
+	 * Determine the number of nodes that can be synchronized.
+	 * synchronous_standby_num can have the special value -1,
+	 * meaning that only one node with the highest non-null priority
+	 * can be considered as synchronous.
+	 */
+	if (synchronous_standby_num == -1)
+		allowed_sync_nodes = 1;
+
+	/*
+	 * Make enough room, there is a maximum of max_wal_senders synchronous
+	 * nodes as we scan though WAL senders here.
+	 */
+	sync_nodes = (int *) palloc(allowed_sync_nodes * sizeof(int));
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		/* Use volatile pointer to prevent code rearrangement */
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+		int j;
+
+		/* Process to next if not active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Process to next if not streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Process to next one if asynchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Process to next one if priority conditions not satisfied */
+		if (priority != 0 &&
+			priority <= walsnd->sync_standby_priority &&
+			*num_sync == allowed_sync_nodes)
+			continue;
+
+		/* Process to next one if flush position is invalid */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * We have a potential synchronous candidate, add it to the
+		 * list of nodes already present or evict the node with highest
+		 * priority found until now.
+		 */
+		if (*num_sync == allowed_sync_nodes)
+		{
+			for (j = 0; j < *num_sync; j++)
+			{
+				volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_nodes[j]];
+				if (walsndloc->sync_standby_priority == priority)
+				{
+					sync_nodes[j] = i;
+					break;
+				}
+			}
+		}
+		else
+		{
+			sync_nodes[*num_sync] = i;
+			(*num_sync)++;
+		}
+
+		/*
+		 * Update priority for next tracking. This needs to be the highest
+		 * priority value in all the existing items.
+		 */
+		if (*num_sync == 1)
+			priority = walsnd->sync_standby_priority;
+		else
+		{
+			priority = 0;
+			for (j = 0; j < *num_sync; j++)
+			{
+				volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_nodes[j]];
+				if (priority < walsndloc->sync_standby_priority)
+					priority = walsndloc->sync_standby_priority;
+			}
+		}
+	}
+
+	return sync_nodes;
+}
+
 /*
  * Update the LSNs on each queue based upon our latest state. This
- * implements a simple policy of first-valid-standby-releases-waiter.
+ * implements a simple policy of first-valid-standbys-release-waiter.
  *
  * Other policies are possible, which would change what we do here and what
  * perhaps also which information we store as well.
@@ -368,11 +496,14 @@ void
 SyncRepReleaseWaiters(void)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
-	volatile WalSnd *syncWalSnd = NULL;
+	int		   *sync_standbys;
 	int			numwrite = 0;
 	int			numflush = 0;
-	int			priority = 0;
+	int			num_sync = 0;
 	int			i;
+	bool		found = false;
+	XLogRecPtr	min_write_pos;
+	XLogRecPtr	min_flush_pos;
 
 	/*
 	 * If this WALSender is serving a standby that is not on the list of
@@ -388,67 +519,99 @@ SyncRepReleaseWaiters(void)
 	/*
 	 * We're a potential sync standby. Release waiters if we are the highest
 	 * priority standby. If there are multiple standbys with same priorities
-	 * then we use the first mentioned standby. If you change this, also
-	 * change pg_stat_get_wal_senders().
+	 * then we use the first mentioned standbys.
 	 */
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
 
-	for (i = 0; i < max_wal_senders; i++)
+	/*
+	 * We should have found ourselves at least, except if it is not expected
+	 * to find any synchronous nodes.
+	 */
+	Assert(num_sync > 0);
+
+	/*
+	 * If we aren't managing one of the standbys with highest priority
+	 * then just leave.
+	 */
+	for (i = 0; i < num_sync; i++)
 	{
-		/* use volatile pointer to prevent code rearrangement */
-		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
-
-		if (walsnd->pid != 0 &&
-			walsnd->state == WALSNDSTATE_STREAMING &&
-			walsnd->sync_standby_priority > 0 &&
-			(priority == 0 ||
-			 priority > walsnd->sync_standby_priority) &&
-			!XLogRecPtrIsInvalid(walsnd->flush))
+		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
+		if (walsndloc == MyWalSnd)
 		{
-			priority = walsnd->sync_standby_priority;
-			syncWalSnd = walsnd;
+			found = true;
+			break;
 		}
 	}
 
 	/*
-	 * We should have found ourselves at least.
+	 * We are definitely not one of the chosen... But we could by
+	 * taking the next takeover.
 	 */
-	Assert(syncWalSnd);
+	if (!found)
+	{
+		LWLockRelease(SyncRepLock);
+		pfree(sync_standbys);
+		announce_next_takeover = true;
+		return;
+	}
 
 	/*
-	 * If we aren't managing the highest priority standby then just leave.
+	 * Even if we are one of the chosen standbys, leave if there
+	 * are less synchronous standbys in waiting state than what is
+	 * expected by the user.
 	 */
-	if (syncWalSnd != MyWalSnd)
+	if (num_sync < synchronous_standby_num &&
+		synchronous_standby_num != -1)
 	{
 		LWLockRelease(SyncRepLock);
-		announce_next_takeover = true;
+		pfree(sync_standbys);
 		return;
 	}
 
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
-	 * this location.
+	 * this location, of course only if all the standbys found as synchronous
+	 * have already reached that point, so first find what are the oldest
+	 * write and flush positions of all the standbys considered in sync...
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
+	min_write_pos = MyWalSnd->write;
+	min_flush_pos = MyWalSnd->flush;
+	for (i = 0; i < num_sync; i++)
+	{
+		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
+
+		SpinLockAcquire(&walsndloc->mutex);
+		if (min_write_pos > walsndloc->write)
+			min_write_pos = walsndloc->write;
+		if (min_flush_pos > walsndloc->flush)
+			min_flush_pos = walsndloc->flush;
+		SpinLockRelease(&walsndloc->mutex);
+	}
+
+	/* ... And now update if necessary */
+	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < min_write_pos)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
+		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = min_write_pos;
 		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
 	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
+	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < min_flush_pos)
 	{
-		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
+		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = min_flush_pos;
 		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
 	}
 
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
-		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
-	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
+		 numwrite, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_WRITE] >> 32),
+		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_WRITE],
+		 numflush, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] >> 32),
+		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_FLUSH]);
 
 	/*
 	 * If we are managing the highest priority standby, though we weren't
-	 * prior to this, then announce we are now the sync standby.
+	 * prior to this, then announce we are now a sync standby.
 	 */
 	if (announce_next_takeover)
 	{
@@ -457,6 +620,9 @@ SyncRepReleaseWaiters(void)
 				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
 						application_name, MyWalSnd->sync_standby_priority)));
 	}
+
+	/* Clean up */
+	pfree(sync_standbys);
 }
 
 /*
@@ -483,6 +649,10 @@ SyncRepGetStandbyPriority(void)
 	if (am_cascading_walsender)
 		return 0;
 
+	/* If no synchronous nodes allowed, no cake for this WAL sender */
+	if (synchronous_standby_num == 0)
+		return 0;
+
 	/* Need a modifiable copy of string */
 	rawstring = pstrdup(SyncRepStandbyNames);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 844a5de..0a918c7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2735,8 +2735,8 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 	MemoryContext per_query_ctx;
 	MemoryContext oldcontext;
 	int		   *sync_priority;
-	int			priority = 0;
-	int			sync_standby = -1;
+	int		   *sync_standbys;
+	int			num_sync = 0;
 	int			i;
 
 	/* check to see if caller supports us returning a tuplestore */
@@ -2767,36 +2767,23 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 	/*
 	 * Get the priorities of sync standbys all in one go, to minimise lock
 	 * acquisitions and to allow us to evaluate who is the current sync
-	 * standby. This code must match the code in SyncRepReleaseWaiters().
+	 * standby.
 	 */
 	sync_priority = palloc(sizeof(int) * max_wal_senders);
 	LWLockAcquire(SyncRepLock, LW_SHARED);
+
+	/* Get first the priorities on each standby as long as we hold a lock */
 	for (i = 0; i < max_wal_senders; i++)
 	{
 		/* use volatile pointer to prevent code rearrangement */
 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
 
-		if (walsnd->pid != 0)
-		{
-			/*
-			 * Treat a standby such as a pg_basebackup background process
-			 * which always returns an invalid flush location, as an
-			 * asynchronous standby.
-			 */
-			sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
-				0 : walsnd->sync_standby_priority;
-
-			if (walsnd->state == WALSNDSTATE_STREAMING &&
-				walsnd->sync_standby_priority > 0 &&
-				(priority == 0 ||
-				 priority > walsnd->sync_standby_priority) &&
-				!XLogRecPtrIsInvalid(walsnd->flush))
-			{
-				priority = walsnd->sync_standby_priority;
-				sync_standby = i;
-			}
-		}
+		sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
+			0 : walsnd->sync_standby_priority;
 	}
+
+	/* Obtain list of synchronous standbys */
+	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
 	LWLockRelease(SyncRepLock);
 
 	for (i = 0; i < max_wal_senders; i++)
@@ -2858,15 +2845,32 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			 */
 			if (sync_priority[i] == 0)
 				values[7] = CStringGetTextDatum("async");
-			else if (i == sync_standby)
-				values[7] = CStringGetTextDatum("sync");
 			else
-				values[7] = CStringGetTextDatum("potential");
+			{
+				int j;
+				bool found = false;
+
+				for (j = 0; j < num_sync; j++)
+				{
+					/* Found that this node is one in sync */
+					if (i == sync_standbys[j])
+					{
+						values[7] = CStringGetTextDatum("sync");
+						found = true;
+						break;
+					}
+				}
+				if (!found)
+					values[7] = CStringGetTextDatum("potential");
+			}
 		}
 
 		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
 	}
+
+	/* Cleanup */
 	pfree(sync_priority);
+	pfree(sync_standbys);
 
 	/* clean up and return the tuplestore */
 	tuplestore_donestoring(tupstore);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a8a17c2..307cb68 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2548,6 +2548,16 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"synchronous_standby_num", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Number of potential synchronous standbys."),
+			NULL
+		},
+		&synchronous_standby_num,
+		-1, -1, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index df98b02..5c1e27c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -235,6 +235,7 @@
 #synchronous_standby_names = ''	# standby servers that provide sync rep
 				# comma-separated list of application_name
 				# from standby(s); '*' = all
+#synchronous_standby_num = -1	# number of standbys servers using sync rep
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
 # - Standby Servers -
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 7eeaf3b..9f05ba9 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -33,6 +33,7 @@
 
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
+extern int	synchronous_standby_num;
 
 /* called by user backend */
 extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
@@ -49,6 +50,7 @@ extern void SyncRepUpdateSyncStandbysDefined(void);
 
 /* called by various procs */
 extern int	SyncRepWakeQueue(bool all, int mode);
+extern int *SyncRepGetSynchronousNodes(int *num_sync);
 
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_commit(int newval, void *extra);

#18

Rajeev rastogi

rajeev.rastogi@huawei.com

over 11 years ago

In reply to: Michael Paquier (#17)

Re: Support for N synchronous standby servers

On 23 August 2014 11:22, Michael Paquier Wrote:

2. Logic of deciding the highest priority one seems to be in-correct.
Assume, s_s_num = 3, s_s_names = 3,4,2,1
standby nodes are in order as: 1,2,3,4,5,6,7

As per the logic in patch, node 4 with priority 2 will not

be added in the list whereas 1,2,3 will be added.

The problem is because priority updated for next tracking is

not the highest priority as of that iteration, it is just priority of
last node added to the list. So it may happen that a node with
higher priority is still there in list but we are comparing with some
other smaller priority.

Fixed. Nice catch!

Actually by re-reading the code I wrote yesterday I found that the fix
in v6 for that is not correct. That's really fixed with v7 attached.

I have done some more review, below are my comments:

1. There are currently two loops on *num_sync, Can we simply the function SyncRepGetSynchronousNodes by moving the priority calculation inside the upper loop
if (*num_sync == allowed_sync_nodes)
{
for (j = 0; j < *num_sync; j++)
{
Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches.
So in this loop itself, we can calculate the priority as well as assigment of new standbys with lower priority.

Let me know if you see any issue with this.

2. Comment inside the function SyncRepReleaseWaiters,
/*
* We should have found ourselves at least, except if it is not expected
* to find any synchronous nodes.
*/
Assert(num_sync > 0);

I think "except if it is not expected to find any synchronous nodes" is not required.
As if it has come till this point means atleast this node is synchronous.

3. Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the same.
IMHO, s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the console without
any knowledge of user.
I see that some discussion has happened regarding this but I think just adding documentation for this is not enough.

I am not sure what issue is observed in adding check during GUC initialization but if there is unavoidable issue during GUC initialization
then can't we try to add check at later points.

4. Similary interaction between parameters s_s_names and s_s_num. I see some discussion has happened regarding this and it is acceptable
to have s_s_num more than s_s_names. But I was thinking should not give atleast some notice message to user for such case along with
some documentation.

config.sgml
5. "At any one time there will be at a number of active synchronous standbys": this sentence is not proper.

6. When this parameter is set to <literal>0</>, all the standby
nodes will be considered as asynchronous.

Can we make this as
When this parameter is set to <literal>0</>, all the standby
nodes will be considered as asynchronous irrespective of value of synchronous_standby_names.

7. Are considered as synchronous the first elements of
<xref linkend="guc-synchronous-standby-names"> in number of
<xref linkend="guc-synchronous-standby-num"> that are
connected.

Starting of this sentence looks to be incomplete.

high-availability.sgml
8. Standbys listed after this will take over the role
of synchronous standby if the first one should fail.

Should not we make it as:

Standbys listed after this will take over the role
of synchronous standby if any of the first synchronous-standby-num standby fails.

Let me know incase if something is not clear.

Thanks and Regards,
Kumar Rajeev Rastogi.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Rajeev rastogi (#18)

1 attachment(s)

Re: Support for N synchronous standby servers

On Wed, Aug 27, 2014 at 2:46 PM, Rajeev rastogi
<rajeev.rastogi@huawei.com> wrote:

I have done some more review, below are my comments:

Thanks!

1. There are currently two loops on *num_sync, Can we simplify the function SyncRepGetSynchronousNodes by moving the priority calculation inside the upper loop
if (*num_sync == allowed_sync_nodes)
{
for (j = 0; j < *num_sync; j++)
{
Anyway we require priority only if *num_sync == allowed_sync_nodes condition matches.
So in this loop itself, we can calculate the priority as well as assigment of new standbys with lower priority.
Let me know if you see any issue with this.

OK, I see, yes this can minimize process a bit so I refactored the
code by integrating the second loop to the first. This has needed the
removal of the break portion as we need to find the highest priority
value among the nodes already determined as synchronous.

2. Comment inside the function SyncRepReleaseWaiters,
/*
* We should have found ourselves at least, except if it is not expected
* to find any synchronous nodes.
*/
Assert(num_sync > 0);

I think "except if it is not expected to find any synchronous nodes" is not required.
As if it has come till this point means at least this node is synchronous.

Yes, removed.

3. Document says that s_s_num should be lesser than max_wal_senders but code wise there is no protection for the same.
IMHO, s_s_num should be lesser than or equal to max_wal_senders otherwise COMMIT will never return back the console without
any knowledge of user.
I see that some discussion has happened regarding this but I think just adding documentation for this is not enough.
I am not sure what issue is observed in adding check during GUC initialization but if there is unavoidable issue during GUC initialization then can't we try to add check at later points.

The trick here is that you cannot really return a warning at GUC
loading level to the user as a warning could be easily triggered if
for example s_s_num is present before max_wal_senders in
postgresql.conf. I am open to any solutions if there are any (like an
error when initializing WAL senders?!). Documentation seems enough for
me to warn the user.

4. Similary interaction between parameters s_s_names and s_s_num. I see some discussion has happened regarding this and it is acceptable
to have s_s_num more than s_s_names. But I was thinking should not give atleast some notice message to user for such case along with
some documentation.

Done. I added the following in the paragraph "Server will wait":
Hence it is recommended to not set <varname>synchronous_standby_num</>
to a value higher than the number of elements in
<varname>synchronous_standby_names</>.

5. "At any one time there will be at a number of active synchronous standbys": this sentence is not proper.

What about that:
"At any one time there can be a number of active synchronous standbys
up to the number defined by <xref
linkend="guc-synchronous-standby-num">"

6. When this parameter is set to <literal>0</>, all the standby
nodes will be considered as asynchronous.

Can we make this as
When this parameter is set to <literal>0</>, all the standby
nodes will be considered as asynchronous irrespective of value of synchronous_standby_names.

Done. This seems proper for the user as we do not care at all about
s_s_names if _num = 0.

7. Are considered as synchronous the first elements of
<xref linkend="guc-synchronous-standby-names"> in number of
<xref linkend="guc-synchronous-standby-num"> that are
connected.

Starting of this sentence looks to be incomplete.

OK, I reworked this part as well. I hope it is clearer.

8. Standbys listed after this will take over the role
of synchronous standby if the first one should fail.

Should not we make it as:

Standbys listed after this will take over the role
of synchronous standby if any of the first synchronous-standby-num standby fails.

Fixed as proposed.

At the same I found a bug with pg_stat_get_wal_senders caused by a
NULL pointer that was freed when s_s_num = 0. Updated patch addressing
comments is attached. On top of that the documentation has been
reworked a bit by replacing the too-high amount of <xref> blocks by
<varname>, having a link to a given variable specified only once.
Regards,
--
Michael

Attachments:

20140828_multi_syncrep_v8.patchtext/x-diff; charset=US-ASCII; name=20140828_multi_syncrep_v8.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2585,2597 **** include_dir 'conf.d'
         <para>
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
!         <xref linkend="synchronous-replication">.
!         At any one time there will be at most one active synchronous standby;
!         transactions waiting for commit will be allowed to proceed after
!         this standby server confirms receipt of their data.
!         The synchronous standby will be the first standby named in this list
!         that is both currently connected and streaming data in real-time
!         (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
--- 2585,2598 ----
         <para>
          Specifies a comma-separated list of standby names that can support
          <firstterm>synchronous replication</>, as described in
!         <xref linkend="synchronous-replication">. At any time there can be
!         a number of active synchronous standbys up to the number
!         defined by <xref linkend="guc-synchronous-standby-num">, transactions
!         waiting for commit will be allowed to proceed after those standby
!         servers confirm receipt of their data. The synchronous standbys will be
!         the first entries named in this list that are both currently connected
!         and streaming data in real-time (as shown by a state of
!         <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
          Other standby servers appearing later in this list represent potential
***************
*** 2627,2632 **** include_dir 'conf.d'
--- 2628,2688 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-synchronous-standby-num" xreflabel="synchronous_standby_num">
+       <term><varname>synchronous_standby_num</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>synchronous_standby_num</> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the number of standbys that support
+         <firstterm>synchronous replication</>.
+        </para>
+        <para>
+         Default value is <literal>-1</>. In this case, if
+         <xref linkend="guc-synchronous-standby-names"> is empty all the
+         standby nodes are considered asynchronous. If there is at least
+         one node name defined, process will wait for one synchronous
+         standby listed.
+        </para>
+        <para>
+         When this parameter is set to <literal>0</>, all the standby
+         nodes will be considered as asynchronous irrespective of value
+         of <varname>synchronous_standby_names</>.
+        </para>
+        <para>
+        This parameter value cannot be higher than
+         <xref linkend="guc-max-wal-senders">.
+        </para>
+        <para>
+         Up to the first <varname>synchronous_standby_num</>
+         stanbys listed in <varname>synchronous_standby_names</>
+         that are connected to a root node at the same time can be
+         synchronous. If there are more elements than the number of standbys
+         required, all the additional standbys are potential synchronous
+         candidates. If <varname>synchronous_standby_names</> is
+         empty, all the standbys are asynchronous. If it is set to the
+         special entry <literal>*</>, a number of standbys up to
+         <varname>synchronous_standby_names</> with the highest
+         pritority are elected as being synchronous.
+        </para>
+        <para>
+         Server will wait for commit confirmation from
+         <varname>synchronous_standby_num</> standbys, meaning that
+         if <varname>synchronous_standby_names</> has less elements
+         than the number of standbys required, server will wait indefinitely
+         for a commit confirmation. Hence it is recommended to not set
+         <varname>synchronous_standby_num</> to a value higher than the
+         number of elements in <varname>synchronous_standby_names</>.
+        </para>
+        <para>
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
        <term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)
        <indexterm>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1081,1092 **** primary_slot_name = 'node_a_slot'
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first matching standby, as specified in
!     <varname>synchronous_standby_names</> on the primary, the reply
!     messages from that standby will be used to wake users waiting for
!     confirmation that the commit record has been received. These parameters
!     allow the administrator to specify which standby servers should be
!     synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
--- 1081,1092 ----
      WAL record is then sent to the standby. The standby sends reply
      messages each time a new batch of WAL data is written to disk, unless
      <varname>wal_receiver_status_interval</> is set to zero on the standby.
!     If the standby is the first <xref linkend="guc-synchronous-standby-num">
!     matching standbys, as specified in <varname>synchronous_standby_names</>
!     on the primary, the reply messages from that standby will be used to wake
!     users waiting for confirmation that the commit records has been received.
!     These parameters allow the administrator to specify which standby servers
!     should be synchronous standbys. Note that the configuration of synchronous
      replication is mainly on the master. Named standbys must be directly
      connected to the master; the master knows nothing about downstream
      standby servers using cascaded replication.
***************
*** 1167,1177 **** primary_slot_name = 'node_a_slot'
  
     <para>
      The best solution for avoiding data loss is to ensure you don't lose
!     your last remaining synchronous standby. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first named standby will be used as the synchronous standby. Standbys
!     listed after this will take over the role of synchronous standby if the
!     first one should fail.
     </para>
  
     <para>
--- 1167,1178 ----
  
     <para>
      The best solution for avoiding data loss is to ensure you don't lose
!     your last remaining synchronous standbys. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The first <varname>synchronous_standby_num</> named standbys will be used as
!     the synchronous standbys. Standbys listed after this will take over the role
!     of synchronous standby if any of the first <varname>synchronous_standby_num</>
!     standby fail.
     </para>
  
     <para>
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 5,11 ****
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the sync standby.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
--- 5,11 ----
   * Synchronous replication is new as of PostgreSQL 9.1.
   *
   * If requested, transaction commits wait until their commit LSN is
!  * acknowledged by the synchronous standbys.
   *
   * This module contains the code for waiting and release of backends.
   * All code in this module executes on the primary. The core streaming
***************
*** 29,39 ****
   * single ordered queue of waiting backends, so that we can avoid
   * searching the through all waiters each time we receive a reply.
   *
!  * In 9.1 we support only a single synchronous standby, chosen from a
!  * priority list of synchronous_standby_names. Before it can become the
!  * synchronous standby it must have caught up with the primary; that may
!  * take some time. Once caught up, the current highest priority standby
!  * will release waiters from the queue.
   *
   * Portions Copyright (c) 2010-2014, PostgreSQL Global Development Group
   *
--- 29,50 ----
   * single ordered queue of waiting backends, so that we can avoid
   * searching the through all waiters each time we receive a reply.
   *
!  * In 9.4 we support the possibility to have multiple synchronous standbys,
!  * whose number is defined by synchronous_standby_num, chosen from a
!  * priority list of synchronous_standby_names. Before one standby can
!  * become a synchronous standby it must have caught up with the primary;
!  * that may take some time.
!  *
!  * Waiters will be released from the queue once the number of standbys
!  * defined by synchronous_standby_num have caught.
!  *
!  * There are special cases though. If synchronous_standby_num is set to 0,
!  * all the nodes are considered as asynchronous and fastpath is out to
!  * leave this portion of the code as soon as possible. If it is set to
!  * -1, process will wait for one node to catch up with the primary only
!  * if synchronous_standby_names is non-empty. This is compatible with
!  * what has been defined in 9.1 as -1 is the default value of
!  * synchronous_standby_num.
   *
   * Portions Copyright (c) 2010-2014, PostgreSQL Global Development Group
   *
***************
*** 59,67 ****
  
  /* User-settable parameters for sync rep */
  char	   *SyncRepStandbyNames;
  
  #define SyncStandbysDefined() \
! 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
  
  static bool announce_next_takeover = true;
  
--- 70,87 ----
  
  /* User-settable parameters for sync rep */
  char	   *SyncRepStandbyNames;
+ int			synchronous_standby_num = -1;
  
+ /*
+  * Synchronous standbys are defined if there is more than
+  * one synchronous standby wanted. In default case, the list
+  * of standbys defined needs to be not empty.
+  */
  #define SyncStandbysDefined() \
! 	(synchronous_standby_num > 0 || \
! 	 (synchronous_standby_num == -1 && \
! 	  SyncRepStandbyNames != NULL && \
! 	  SyncRepStandbyNames[0] != '\0'))
  
  static bool announce_next_takeover = true;
  
***************
*** 206,212 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
--- 226,232 ----
  			ereport(WARNING,
  					(errcode(ERRCODE_ADMIN_SHUTDOWN),
  					 errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			whereToSendOutput = DestNone;
  			SyncRepCancelWait();
  			break;
***************
*** 223,229 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
  			SyncRepCancelWait();
  			break;
  		}
--- 243,249 ----
  			QueryCancelPending = false;
  			ereport(WARNING,
  					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby(s).")));
  			SyncRepCancelWait();
  			break;
  		}
***************
*** 357,365 **** SyncRepInitConfig(void)
  	}
  }
  
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standby-releases-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
--- 377,493 ----
  	}
  }
  
+ 
+ /*
+  * Obtain a palloc'd array containing positions of standbys currently
+  * considered as synchronous. Caller is responsible for freeing the
+  * data obtained and should as well take a necessary lock on SyncRepLock.
+  */
+ int *
+ SyncRepGetSynchronousNodes(int *num_sync)
+ {
+ 	int	   *sync_nodes;
+ 	int		priority = 0;
+ 	int		i;
+ 	int		allowed_sync_nodes = synchronous_standby_num;
+ 
+ 	/* Initialize */
+ 	*num_sync = 0;
+ 
+ 	/* Leave if no synchronous nodes allowed */
+ 	if (synchronous_standby_num == 0)
+ 		return NULL;
+ 
+ 	/*
+ 	 * Determine the number of nodes that can be synchronized.
+ 	 * synchronous_standby_num can have the special value -1,
+ 	 * meaning that only one node with the highest non-null priority
+ 	 * can be considered as synchronous.
+ 	 */
+ 	if (synchronous_standby_num == -1)
+ 		allowed_sync_nodes = 1;
+ 
+ 	/*
+ 	 * Make enough room, there is a maximum of max_wal_senders synchronous
+ 	 * nodes as we scan though WAL senders here.
+ 	 */
+ 	sync_nodes = (int *) palloc(allowed_sync_nodes * sizeof(int));
+ 
+ 	for (i = 0; i < max_wal_senders; i++)
+ 	{
+ 		/* Use volatile pointer to prevent code rearrangement */
+ 		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+ 		int j;
+ 
+ 		/* Process to next if not active */
+ 		if (walsnd->pid == 0)
+ 			continue;
+ 
+ 		/* Process to next if not streaming */
+ 		if (walsnd->state != WALSNDSTATE_STREAMING)
+ 			continue;
+ 
+ 		/* Process to next one if asynchronous */
+ 		if (walsnd->sync_standby_priority == 0)
+ 			continue;
+ 
+ 		/* Process to next one if priority conditions not satisfied */
+ 		if (priority != 0 &&
+ 			priority <= walsnd->sync_standby_priority &&
+ 			*num_sync == allowed_sync_nodes)
+ 			continue;
+ 
+ 		/* Process to next one if flush position is invalid */
+ 		if (XLogRecPtrIsInvalid(walsnd->flush))
+ 			continue;
+ 
+ 		/*
+ 		 * We have a potential synchronous candidate, add it to the
+ 		 * list of nodes already present or evict the node with highest
+ 		 * priority found until now. Track as well the highest priority
+ 		 * value in all the existing items, this helps in determining
+ 		 * what would be a standby to evict from the result array.
+ 		 */
+ 		if (*num_sync == allowed_sync_nodes)
+ 		{
+ 			int new_priority = 0;
+ 
+ 			for (j = 0; j < *num_sync; j++)
+ 			{
+ 				volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_nodes[j]];
+ 
+ 				/*
+ 				 * Note that we cannot leave now as we need to still
+ 				 * find what is the highest priority in the set of
+ 				 * synchronous standbys.
+ 				 */
+ 				if (walsndloc->sync_standby_priority == priority)
+ 					sync_nodes[j] = i;
+ 
+ 				/* Update priority to highest value available */
+ 				if (new_priority < walsndloc->sync_standby_priority)
+ 					new_priority = walsndloc->sync_standby_priority;
+ 			}
+ 			priority = new_priority;
+ 		}
+ 		else
+ 		{
+ 			volatile WalSnd *walsndloc = &WalSndCtl->walsnds[i];
+ 			sync_nodes[*num_sync] = i;
+ 			(*num_sync)++;
+ 
+ 			/* Update priority to highest value available */
+ 			if (priority < walsndloc->sync_standby_priority)
+ 				priority = walsndloc->sync_standby_priority;
+ 		}
+ 	}
+ 
+ 	return sync_nodes;
+ }
+ 
  /*
   * Update the LSNs on each queue based upon our latest state. This
!  * implements a simple policy of first-valid-standbys-release-waiter.
   *
   * Other policies are possible, which would change what we do here and what
   * perhaps also which information we store as well.
***************
*** 368,378 **** void
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			priority = 0;
  	int			i;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
--- 496,509 ----
  SyncRepReleaseWaiters(void)
  {
  	volatile WalSndCtlData *walsndctl = WalSndCtl;
! 	int		   *sync_standbys;
  	int			numwrite = 0;
  	int			numflush = 0;
! 	int			num_sync = 0;
  	int			i;
+ 	bool		found = false;
+ 	XLogRecPtr	min_write_pos;
+ 	XLogRecPtr	min_flush_pos;
  
  	/*
  	 * If this WALSender is serving a standby that is not on the list of
***************
*** 388,454 **** SyncRepReleaseWaiters(void)
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standby. If you change this, also
! 	 * change pg_stat_get_wal_senders().
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
  
! 	for (i = 0; i < max_wal_senders; i++)
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
! 
! 		if (walsnd->pid != 0 &&
! 			walsnd->state == WALSNDSTATE_STREAMING &&
! 			walsnd->sync_standby_priority > 0 &&
! 			(priority == 0 ||
! 			 priority > walsnd->sync_standby_priority) &&
! 			!XLogRecPtrIsInvalid(walsnd->flush))
  		{
! 			priority = walsnd->sync_standby_priority;
! 			syncWalSnd = walsnd;
  		}
  	}
  
  	/*
! 	 * We should have found ourselves at least.
  	 */
! 	Assert(syncWalSnd);
  
  	/*
! 	 * If we aren't managing the highest priority standby then just leave.
  	 */
! 	if (syncWalSnd != MyWalSnd)
  	{
  		LWLockRelease(SyncRepLock);
! 		announce_next_takeover = true;
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location.
  	 */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now the sync standby.
  	 */
  	if (announce_next_takeover)
  	{
--- 519,614 ----
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
! 	 * then we use the first mentioned standbys.
  	 */
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  
! 	/* We should have found ourselves at least */
! 	Assert(num_sync > 0);
! 
! 	/*
! 	 * If we aren't managing one of the standbys with highest priority
! 	 * then just leave.
! 	 */
! 	for (i = 0; i < num_sync; i++)
  	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 		if (walsndloc == MyWalSnd)
  		{
! 			found = true;
! 			break;
  		}
  	}
  
  	/*
! 	 * We are definitely not one of the chosen... But we could by
! 	 * taking the next takeover.
  	 */
! 	if (!found)
! 	{
! 		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
! 		announce_next_takeover = true;
! 		return;
! 	}
  
  	/*
! 	 * Even if we are one of the chosen standbys, leave if there
! 	 * are less synchronous standbys in waiting state than what is
! 	 * expected by the user.
  	 */
! 	if (num_sync < synchronous_standby_num &&
! 		synchronous_standby_num != -1)
  	{
  		LWLockRelease(SyncRepLock);
! 		pfree(sync_standbys);
  		return;
  	}
  
  	/*
  	 * Set the lsn first so that when we wake backends they will release up to
! 	 * this location, of course only if all the standbys found as synchronous
! 	 * have already reached that point, so first find what are the oldest
! 	 * write and flush positions of all the standbys considered in sync...
  	 */
! 	min_write_pos = MyWalSnd->write;
! 	min_flush_pos = MyWalSnd->flush;
! 	for (i = 0; i < num_sync; i++)
! 	{
! 		volatile WalSnd *walsndloc = &WalSndCtl->walsnds[sync_standbys[i]];
! 
! 		SpinLockAcquire(&walsndloc->mutex);
! 		if (min_write_pos > walsndloc->write)
! 			min_write_pos = walsndloc->write;
! 		if (min_flush_pos > walsndloc->flush)
! 			min_flush_pos = walsndloc->flush;
! 		SpinLockRelease(&walsndloc->mutex);
! 	}
! 
! 	/* ... And now update if necessary */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < min_write_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = min_write_pos;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < min_flush_pos)
  	{
! 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = min_flush_pos;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_WRITE] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_WRITE],
! 		 numflush, (uint32) (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] >> 32),
! 		 (uint32) walsndctl->lsn[SYNC_REP_WAIT_FLUSH]);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
! 	 * prior to this, then announce we are now a sync standby.
  	 */
  	if (announce_next_takeover)
  	{
***************
*** 457,462 **** SyncRepReleaseWaiters(void)
--- 617,625 ----
  				(errmsg("standby \"%s\" is now the synchronous standby with priority %u",
  						application_name, MyWalSnd->sync_standby_priority)));
  	}
+ 
+ 	/* Clean up */
+ 	pfree(sync_standbys);
  }
  
  /*
***************
*** 483,488 **** SyncRepGetStandbyPriority(void)
--- 646,655 ----
  	if (am_cascading_walsender)
  		return 0;
  
+ 	/* If no synchronous nodes allowed, no cake for this WAL sender */
+ 	if (synchronous_standby_num == 0)
+ 		return 0;
+ 
  	/* Need a modifiable copy of string */
  	rawstring = pstrdup(SyncRepStandbyNames);
  
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 2735,2742 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int			priority = 0;
! 	int			sync_standby = -1;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
--- 2735,2742 ----
  	MemoryContext per_query_ctx;
  	MemoryContext oldcontext;
  	int		   *sync_priority;
! 	int		   *sync_standbys;
! 	int			num_sync = 0;
  	int			i;
  
  	/* check to see if caller supports us returning a tuplestore */
***************
*** 2767,2802 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby. This code must match the code in SyncRepReleaseWaiters().
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		if (walsnd->pid != 0)
! 		{
! 			/*
! 			 * Treat a standby such as a pg_basebackup background process
! 			 * which always returns an invalid flush location, as an
! 			 * asynchronous standby.
! 			 */
! 			sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 				0 : walsnd->sync_standby_priority;
! 
! 			if (walsnd->state == WALSNDSTATE_STREAMING &&
! 				walsnd->sync_standby_priority > 0 &&
! 				(priority == 0 ||
! 				 priority > walsnd->sync_standby_priority) &&
! 				!XLogRecPtrIsInvalid(walsnd->flush))
! 			{
! 				priority = walsnd->sync_standby_priority;
! 				sync_standby = i;
! 			}
! 		}
  	}
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
--- 2767,2789 ----
  	/*
  	 * Get the priorities of sync standbys all in one go, to minimise lock
  	 * acquisitions and to allow us to evaluate who is the current sync
! 	 * standby.
  	 */
  	sync_priority = palloc(sizeof(int) * max_wal_senders);
  	LWLockAcquire(SyncRepLock, LW_SHARED);
+ 
+ 	/* Get first the priorities on each standby as long as we hold a lock */
  	for (i = 0; i < max_wal_senders; i++)
  	{
  		/* use volatile pointer to prevent code rearrangement */
  		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
  
! 		sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
! 			0 : walsnd->sync_standby_priority;
  	}
+ 
+ 	/* Obtain list of synchronous standbys */
+ 	sync_standbys = SyncRepGetSynchronousNodes(&num_sync);
  	LWLockRelease(SyncRepLock);
  
  	for (i = 0; i < max_wal_senders; i++)
***************
*** 2858,2872 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
- 			else if (i == sync_standby)
- 				values[7] = CStringGetTextDatum("sync");
  			else
! 				values[7] = CStringGetTextDatum("potential");
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
  	pfree(sync_priority);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
--- 2845,2877 ----
  			 */
  			if (sync_priority[i] == 0)
  				values[7] = CStringGetTextDatum("async");
  			else
! 			{
! 				int j;
! 				bool found = false;
! 
! 				for (j = 0; j < num_sync; j++)
! 				{
! 					/* Found that this node is one in sync */
! 					if (i == sync_standbys[j])
! 					{
! 						values[7] = CStringGetTextDatum("sync");
! 						found = true;
! 						break;
! 					}
! 				}
! 				if (!found)
! 					values[7] = CStringGetTextDatum("potential");
! 			}
  		}
  
  		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
  	}
+ 
+ 	/* Cleanup */
  	pfree(sync_priority);
+ 	if (sync_standbys)
+ 		pfree(sync_standbys);
  
  	/* clean up and return the tuplestore */
  	tuplestore_donestoring(tupstore);
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2548,2553 **** static struct config_int ConfigureNamesInt[] =
--- 2548,2563 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"synchronous_standby_num", PGC_SIGHUP, REPLICATION_MASTER,
+ 			gettext_noop("Number of potential synchronous standbys."),
+ 			NULL
+ 		},
+ 		&synchronous_standby_num,
+ 		-1, -1, INT_MAX,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 235,240 ****
--- 235,241 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_standby_num = -1	# number of standbys servers using sync rep
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 33,38 ****
--- 33,39 ----
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
+ extern int	synchronous_standby_num;
  
  /* called by user backend */
  extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
***************
*** 49,54 **** extern void SyncRepUpdateSyncStandbysDefined(void);
--- 50,56 ----
  
  /* called by various procs */
  extern int	SyncRepWakeQueue(bool all, int mode);
+ extern int *SyncRepGetSynchronousNodes(int *num_sync);
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);

#20

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Michael Paquier (#19)

Re: Support for N synchronous standby servers

On 08/28/2014 10:10 AM, Michael Paquier wrote:

+ #synchronous_standby_num = -1 # number of standbys servers using sync rep

To be honest, that's a horrible name for the GUC. Back when synchronous
replication was implemented, we had looong discussions on this feature.
It was called "quorum commit" back then. I'd suggest using the "quorum"
term in this patch, too, that's a fairly well-known term in distributed
computing for this.

When synchronous replication was added, quorum was left out to keep
things simple; the current feature set was the most we could all agree
on to be useful. If you search the archives for "quorum commit" you'll
see what I mean. There was a lot of confusion on what is possible and
what is useful, but regarding this particular patch: people wanted to be
able to describe more complicated scenarios. For example, imagine that
you have a master and two standbys in one the primary data center, and
two more standbys in a different data center. It should be possible to
specify that you must get acknowledgment from at least on standby in
both data centers. Maybe you could hack that by giving the standbys in
the same data center the same name, but it gets ugly, and it still won't
scale to even more complex scenarios.

Maybe that's OK - we don't necessarily need to solve all scenarios at
once. But it's worth considering.

BTW, how does this patch behave if there are multiple standbys connected
with the same name?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Heikki Linnakangas (#20)

Re: Support for N synchronous standby servers

On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 08/28/2014 10:10 AM, Michael Paquier wrote:

+ #synchronous_standby_num = -1 # number of standbys servers using sync
rep

To be honest, that's a horrible name for the GUC. Back when synchronous
replication was implemented, we had looong discussions on this feature. It
was called "quorum commit" back then. I'd suggest using the "quorum" term in
this patch, too, that's a fairly well-known term in distributed computing
for this.

I am open to any suggestions. Then what about the following parameter names?
- synchronous_quorum_num
- synchronous_standby_quorum
- synchronous_standby_quorum_num
- synchronous_quorum_commit

When synchronous replication was added, quorum was left out to keep things
simple; the current feature set was the most we could all agree on to be
useful. If you search the archives for "quorum commit" you'll see what I
mean. There was a lot of confusion on what is possible and what is useful,
but regarding this particular patch: people wanted to be able to describe
more complicated scenarios. For example, imagine that you have a master and
two standbys in one the primary data center, and two more standbys in a
different data center. It should be possible to specify that you must get
acknowledgment from at least on standby in both data centers. Maybe you
could hack that by giving the standbys in the same data center the same
name, but it gets ugly, and it still won't scale to even more complex
scenarios.

Currently two nodes can only have the same priority if they have the
same application_name, so we could for example add a new connstring
parameter called, let's say application_group, to define groups of
nodes that will have the same priority (if a node does not define
application_group, it defaults to application_name, if app_name is
NULL, well we don't care much it cannot be a sync candidate). That's a
first idea that we could use to control groups of nodes. And we could
switch syncrep.c to use application_group in s_s_names instead of
application_name. That would be backward-compatible, and could open
the door for more improvements for quorum commits as we could control
groups node nodes. Well this is a super-set of what application_name
can already do, but there is no problem to identify single nodes of
the same data center and how much they could be late in replication,
so I think that this would be really user-friendly. An idea similar to
that would be a base work for the next thing... See below.

Now, in your case the two nodes on the second data center need to have
the same priority either way. With this patch you can achieve that
with the same node name. Where things are not that cool with this
patch is something like that though:
- 5 slaves: 1 with master (node_local), 2 on a 2nd data center
(node_center1), 2 last on a 3rd data center (node_center2)
- s_s_num = 3
- s_s_names = 'node_local,node_center1,node_center2'

In this case the nodes have the following priority:
- node_local => 1
- the 2 nodes with node_center1 => 2
- the 2 nodes with node_center2 => 3
In this {1,2,2,3,3} schema, the patch makes system wait for
node_local, and the two nodes in node_center1 without caring about the
ones in node_center2 as it will pick up only the nodes with the
highest priority. If user expects the system to wait for a node in
node_center2 he'll be disappointed. That's perhaps where we could
improve things, by adding an extra parameter able to control the
priority ranks, say synchronous_priority_check:
- [absolute|individual], wait for the first s_s_num nodes having the
lowest priority, in this case we'll wait for {1,2,2}
- group: for only one node in the lowest s_s_num priorities, here
we'll wait for {1,2,3}
Note that we may not even need this parameter if we assume by default
that we wait for only one node in a given group that has the same
priority.

Maybe that's OK - we don't necessarily need to solve all scenarios at once.
But it's worth considering.

Parametrizing and coverage of the user expectations are tricky. Either
way not everybody can be happy :) There are even people unhappy now
because we can only define one single sync node.

BTW, how does this patch behave if there are multiple standbys connected
with the same name?

All the nodes have the same priority. For example in the case of a
cluster with 5 slaves having the same application name and s_s_num =3,
the first three nodes when scanning the WAL sender array are expected
to return a COMMIT before committing locally:
=# show synchronous_standby_num ;
synchronous_standby_num
-------------------------
3
(1 row)
=# show synchronous_standby_names ;
synchronous_standby_names
---------------------------
node
(1 row)
=# SELECT application_name, client_port,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority, sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, appl
application_name | client_port | replay_delta | sync_priority | sync_state
------------------+-------------+--------------+---------------+------------
node | 50251 | 0 | 1 | sync
node | 50252 | 0 | 1 | sync
node | 50253 | 0 | 1 | sync
node | 50254 | 0 | 1 | potential
node | 50255 | 0 | 1 | potential
(5 rows)

After writing this long message, and thinking more about that, I kind
of like the group approach. Thoughts welcome.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Michael Paquier (#21)

Re: Support for N synchronous standby servers

On Thu, Sep 11, 2014 at 9:10 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Thu, Sep 11, 2014 at 5:21 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 08/28/2014 10:10 AM, Michael Paquier wrote:

+ #synchronous_standby_num = -1 # number of standbys servers using sync
rep

To be honest, that's a horrible name for the GUC. Back when synchronous
replication was implemented, we had looong discussions on this feature.

was called "quorum commit" back then. I'd suggest using the "quorum"

term in

this patch, too, that's a fairly well-known term in distributed

computing

for this.

I am open to any suggestions. Then what about the following parameter

names?

- synchronous_quorum_num
- synchronous_standby_quorum
- synchronous_standby_quorum_num
- synchronous_quorum_commit

or simply synchronous_standbys

When synchronous replication was added, quorum was left out to keep

things

simple; the current feature set was the most we could all agree on to be
useful. If you search the archives for "quorum commit" you'll see what I
mean. There was a lot of confusion on what is possible and what is

useful,

but regarding this particular patch: people wanted to be able to

describe

more complicated scenarios. For example, imagine that you have a master

and

two standbys in one the primary data center, and two more standbys in a
different data center. It should be possible to specify that you must

get

acknowledgment from at least on standby in both data centers. Maybe you
could hack that by giving the standbys in the same data center the same
name, but it gets ugly, and it still won't scale to even more complex
scenarios.

Won't this problem be handled if synchronous mode is supported in
cascading replication?
I am not sure about the feasibility of same, but I think the basic problem
mentioned above can be addressed and may be others as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#23

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Michael Paquier (#19)

Re: Support for N synchronous standby servers

On Thu, Aug 28, 2014 at 12:40 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Wed, Aug 27, 2014 at 2:46 PM, Rajeev rastogi
<rajeev.rastogi@huawei.com> wrote:

I have done some more review, below are my comments:

Thanks!

1. There are currently two loops on *num_sync, Can we simplify the

function SyncRepGetSynchronousNodes by moving the priority calculation
inside the upper loop

if (*num_sync == allowed_sync_nodes)
{
for (j = 0; j < *num_sync; j++)
{
Anyway we require priority only if *num_sync ==

allowed_sync_nodes condition matches.

So in this loop itself, we can calculate the priority as well

as assigment of new standbys with lower priority.

Let me know if you see any issue with this.

OK, I see, yes this can minimize process a bit so I refactored the
code by integrating the second loop to the first. This has needed the
removal of the break portion as we need to find the highest priority
value among the nodes already determined as synchronous.

2. Comment inside the function SyncRepReleaseWaiters,
/*
* We should have found ourselves at least, except if it is not

expected

* to find any synchronous nodes.
*/
Assert(num_sync > 0);

I think "except if it is not expected to find any synchronous

nodes" is not required.

As if it has come till this point means at least this node is

synchronous.

Yes, removed.

3. Document says that s_s_num should be lesser than

max_wal_senders but code wise there is no protection for the same.

IMHO, s_s_num should be lesser than or equal to max_wal_senders

otherwise COMMIT will never return back the console without

any knowledge of user.
I see that some discussion has happened regarding this but I

think just adding documentation for this is not enough.

I am not sure what issue is observed in adding check during GUC

initialization but if there is unavoidable issue during GUC initialization
then can't we try to add check at later points.

The trick here is that you cannot really return a warning at GUC
loading level to the user as a warning could be easily triggered if
for example s_s_num is present before max_wal_senders in
postgresql.conf. I am open to any solutions if there are any (like an
error when initializing WAL senders?!). Documentation seems enough for
me to warn the user.

How about making it as a PGC_POSTMASTER parameter and then
have a check similar to below in PostmasterMain()

/*
* Check for invalid combinations of GUC settings.
*/
if (ReservedBackends >= MaxConnections)
{
write_stderr("%s: superuser_reserved_connections must be less than
max_connections\n", progname);
ExitPostmaster(1);
}

if (max_wal_senders >= MaxConnections)
{
write_stderr("%s: max_wal_senders must be less than max_connections\n",
progname);
ExitPostmaster(1);
}

if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\",
\"hot_standby\", or \"logical\"")));

if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level
\"archive\", \"hot_standby\", or \"logical\"")));

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#24

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Michael Paquier (#21)

Re: Support for N synchronous standby servers

On Wed, Sep 10, 2014 at 11:40 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Currently two nodes can only have the same priority if they have the
same application_name, so we could for example add a new connstring
parameter called, let's say application_group, to define groups of
nodes that will have the same priority (if a node does not define
application_group, it defaults to application_name, if app_name is
NULL, well we don't care much it cannot be a sync candidate). That's a
first idea that we could use to control groups of nodes. And we could
switch syncrep.c to use application_group in s_s_names instead of
application_name. That would be backward-compatible, and could open
the door for more improvements for quorum commits as we could control
groups node nodes. Well this is a super-set of what application_name
can already do, but there is no problem to identify single nodes of
the same data center and how much they could be late in replication,
so I think that this would be really user-friendly. An idea similar to
that would be a base work for the next thing... See below.

In general, I think the user's requirement for what synchronous
standbys could need to acknowledge a commit could be an arbitrary
Boolean expression - well, probably no NOT, but any amount of AND and
OR that you want to use. Can someone want A OR (((B AND C) OR (D AND
E)) AND F)? Maybe! Based on previous discussions, it seems not
unlikely that as soon as we decide we don't want to support that,
someone will tell us they can't live without it. In general, though,
I'd expect the two common patterns to be more or less what you've set
forth above: any K servers from set X plus any L servers from set Y
plus any M servers from set Z, etc. However, I'm not confident it's
right to control this by adding more configuration on the client side.
I think it would be better to stick with the idea that each client
specifies an application_name, and then the master specifies the
policy in some way. One advantage of that is that you can change the
rules in ONE place - the master - rather than potentially having to
update every client.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Robert Haas (#24)

Re: Support for N synchronous standby servers

On Fri, Sep 12, 2014 at 12:48 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 10, 2014 at 11:40 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Currently two nodes can only have the same priority if they have the
same application_name, so we could for example add a new connstring
parameter called, let's say application_group, to define groups of
nodes that will have the same priority (if a node does not define
application_group, it defaults to application_name, if app_name is
NULL, well we don't care much it cannot be a sync candidate). That's a
first idea that we could use to control groups of nodes. And we could
switch syncrep.c to use application_group in s_s_names instead of
application_name. That would be backward-compatible, and could open
the door for more improvements for quorum commits as we could control
groups node nodes. Well this is a super-set of what application_name
can already do, but there is no problem to identify single nodes of
the same data center and how much they could be late in replication,
so I think that this would be really user-friendly. An idea similar to
that would be a base work for the next thing... See below.

In general, I think the user's requirement for what synchronous
standbys could need to acknowledge a commit could be an arbitrary
Boolean expression - well, probably no NOT, but any amount of AND and
OR that you want to use. Can someone want A OR (((B AND C) OR (D AND
E)) AND F)? Maybe! Based on previous discussions, it seems not
unlikely that as soon as we decide we don't want to support that,
someone will tell us they can't live without it. In general, though,
I'd expect the two common patterns to be more or less what you've set
forth above: any K servers from set X plus any L servers from set Y
plus any M servers from set Z, etc. However, I'm not confident it's
right to control this by adding more configuration on the client side.
I think it would be better to stick with the idea that each client
specifies an application_name, and then the master specifies the
policy in some way. One advantage of that is that you can change the
rules in ONE place - the master - rather than potentially having to
update every client.

OK. I see your point.

Now, what about the following assumptions (somewhat restrictions to
facilitate the user experience for setting syncrep and the
parametrization of this feature):
- Nodes are defined within the same set (or group) if they have the
same priority, aka the same application_name.
- One node cannot be a part of two sets. That's obvious...

The current patch has its own merit, but it fails in the case you and
Heikki are describing: wait for k nodes in set 1 (nodes with lowest
priority value), l nodes in set 2 (nodes with priority 2nd lowest
priority value), etc.
What is does is, if for example we have a set of nodes with priorities
{0,1,1,2,2,3,3}, backends will wait for flush_position from the first
s_s_num nodes. By setting s_s_num to 3, we'll wait for {0,1,1}, to 2
{0,1,1,2}, etc.

Now what about that: instead of waiting for the nodes in "absolute"
order like the way current patch does, let's do it in a "relative"
way. By that I mean that a backend waits for flush_position
confirmation only from *1* node among a set of nodes having the same
priority. So by using s_s_num = 3, we'll wait for {0, "one node with
1", "one node with 2"}, and you can guess the rest.

The point is as well that we can keep s_s_num behavior as it is now:
- if set at -1, we rely on the default way of doing with s_s_names
(empty means all nodes async, at least one entry meaning that we need
to wait for a node)
- if set at 0, all nodes are forced to be async'd
- if set at n > 1, we have to wait for one node in each set of the
N-lowest priority values.
I'd see enough users happy with those improvements, and that would
help improving the coverage of test cases that Heikki and you
envisioned.

By the way, as the CF is running low in time, I am going to mark this
patch as "Returned with Feedback" as I have received enough feedback.
I am still planning to work on that for the next CF, so it would be
great if there is an agreement on what can be done for this feature to
avoid blind progress. Particularly I see some merit in the last idea,
that we could still extend by allowing values of the type "k,l,m" in
s_s_num to let user decide: wait for 3 sets, k nodes in set 1, l nodes
in set 2 and m nodes in set 3. Having a GUC parameter with integer
values is not that user-friendly though, so I think that I'd hold on
having only one node for a single set.

Thoughts?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Michael Paquier (#25)

Re: Support for N synchronous standby servers

On Fri, Sep 12, 2014 at 1:13 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

OK. I see your point.

Now, what about the following assumptions (somewhat restrictions to
facilitate the user experience for setting syncrep and the
parametrization of this feature):
- Nodes are defined within the same set (or group) if they have the
same priority, aka the same application_name.
- One node cannot be a part of two sets. That's obvious...

I feel pretty strongly that we should encourage people to use a
different application_name for every server. The fact that a server
is interchangeable for one purpose does not mean that it's
interchangeable for all purposes; let's try to keep application_name
as the identifier for a server, and design the other facilities we
need around that.

The current patch has its own merit, but it fails in the case you and
Heikki are describing: wait for k nodes in set 1 (nodes with lowest
priority value), l nodes in set 2 (nodes with priority 2nd lowest
priority value), etc.
What is does is, if for example we have a set of nodes with priorities
{0,1,1,2,2,3,3}, backends will wait for flush_position from the first
s_s_num nodes. By setting s_s_num to 3, we'll wait for {0,1,1}, to 2
{0,1,1,2}, etc.

Now what about that: instead of waiting for the nodes in "absolute"
order like the way current patch does, let's do it in a "relative"
way. By that I mean that a backend waits for flush_position
confirmation only from *1* node among a set of nodes having the same
priority. So by using s_s_num = 3, we'll wait for {0, "one node with
1", "one node with 2"}, and you can guess the rest.

The point is as well that we can keep s_s_num behavior as it is now:
- if set at -1, we rely on the default way of doing with s_s_names
(empty means all nodes async, at least one entry meaning that we need
to wait for a node)
- if set at 0, all nodes are forced to be async'd
- if set at n > 1, we have to wait for one node in each set of the
N-lowest priority values.
I'd see enough users happy with those improvements, and that would
help improving the coverage of test cases that Heikki and you
envisioned.

Sounds confusing. I hate to be the guy always suggesting a
mini-language (cf. recent discussion of an expression syntax for
pgbench), but we could do much more powerful and flexible things here
if we had one. For example, suppose we let each element of
synchronous_standby_names use the constructs (X,Y,Z,...) [meaning one
of the parenthesized severs], N(X,Y,Z,...) [meaning N of the
parenthesized servers]. Then if you want to consider a commit
acknowledge when you have any two of foo, bar, and baz you can write:

synchronous_standby_names = 2(foo,bar,baz)

And if you want to acknowledge when you've got either foo or both bar
and baz, you can write:

synchronous_standby_names = (foo,2(bar,baz))

And if you want one of foo and bar and one of baz and bletch, you can write:

synchronous_standby_names = 2((foo,bar),(baz,bletch))

The crazy-complicated policy I mentioned upthread would be:

synchronous_standby_names = (a,2((2(b,c),2(d,e)),f))
or (equivalently and simpler)
synchronous_standby_names = (a,3(b,c,f),3(d,e,f))

We could have a rule that we fall back to the next rule in
synchronous_standby_names when the first rule can never be satisfied
by the connected standbys. For example, if you have foo, bar, and
baz, and you want any two of the three, but wish to prefer waiting for
foo over the others when it's connected, then you could write:

synchronous_standby_names = 2(foo,2(bar,baz)), 2(bar, baz)

If foo disconnects, the first rule can never be met, so we use the
second rule. It's still 2 out of 3, just as if we'd written
2(foo,bar,baz) but we won't accept an ack from bar and baz as
sufficient unless foo is dead.

The exact syntax here is of course debatable; maybe somebody come up
with something better. But it doesn't seem like it would be
incredibly painful to implement, and it would give us a lot of
flexibility.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Gregory Smith

gregsmithpgsql@gmail.com

over 11 years ago

In reply to: Robert Haas (#26)

Re: Support for N synchronous standby servers

On 9/12/14, 2:28 PM, Robert Haas wrote:

I hate to be the guy always suggesting a mini-language (cf. recent
discussion of an expression syntax for pgbench), but we could do much
more powerful and flexible things here if we had one. For example,
suppose we let each element of synchronous_standby_names use the
constructs (X,Y,Z,...)

While I have my old list history hat on this afternoon, when the 9.1
deadline was approaching I said that some people were not going to be
happy until "is it safe to commit?" calls an arbitrary function that is
passed the names of all the active servers, and then they could plug
whatever consensus rule they wanted into there. And then I said that if
we actually wanted to ship something, it should be some stupid simple
thing like just putting a list of servers in synchronous_standby_names
and proceeding if one is active. One of those two ideas worked out...

Can you make a case for why it needs to be a mini-language instead of a
function?

--
Greg Smith greg.smith@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Gregory Smith (#27)

Re: Support for N synchronous standby servers

On Fri, Sep 12, 2014 at 2:44 PM, Gregory Smith <gregsmithpgsql@gmail.com> wrote:

On 9/12/14, 2:28 PM, Robert Haas wrote:

I hate to be the guy always suggesting a mini-language (cf. recent
discussion of an expression syntax for pgbench), but we could do much more
powerful and flexible things here if we had one. For example, suppose we let
each element of synchronous_standby_names use the constructs (X,Y,Z,...)

While I have my old list history hat on this afternoon, when the 9.1
deadline was approaching I said that some people were not going to be happy
until "is it safe to commit?" calls an arbitrary function that is passed the
names of all the active servers, and then they could plug whatever consensus
rule they wanted into there. And then I said that if we actually wanted to
ship something, it should be some stupid simple thing like just putting a
list of servers in synchronous_standby_names and proceeding if one is
active. One of those two ideas worked out...

Can you make a case for why it needs to be a mini-language instead of a
function?

I think so. If we make it a function, then it's either the kind of
function that you access via pg_proc, or it's the kind that's written
in C and installed by storing a function pointer in a hook variable
from _PG_init(). The first approach is a non-starter because it would
require walsender to be connected to the database where that function
lives, which is a non-starter at least for logical replication where
we need walsender to be connected to the database being replicated.
Even if we found some way around that problem, and I'm not sure there
is one, I suspect the overhead would be pretty high. The second
approach - a hook that can be accessed directly by loadable modules -
seems like it would work fine; the only problem that is that you've
got to write your policy function in C. But I have no issue with
exposing it that way if someone wants to write a patch. There is no
joy in getting between the advanced user and his nutty-complicated
sync rep configuration. However, I suspect that most users will
prefer a more user-friendly interface.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Robert Haas (#28)

Re: Support for N synchronous standby servers

On Sat, Sep 13, 2014 at 3:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think so. If we make it a function, then it's either the kind of
function that you access via pg_proc, or it's the kind that's written
in C and installed by storing a function pointer in a hook variable
from _PG_init(). The first approach is a non-starter because it would
require walsender to be connected to the database where that function
lives, which is a non-starter at least for logical replication where
we need walsender to be connected to the database being replicated.
Even if we found some way around that problem, and I'm not sure there
is one, I suspect the overhead would be pretty high. The second
approach - a hook that can be accessed directly by loadable modules -
seems like it would work fine; the only problem that is that you've
got to write your policy function in C. But I have no issue with
exposing it that way if someone wants to write a patch. There is no
joy in getting between the advanced user and his nutty-complicated
sync rep configuration. However, I suspect that most users will
prefer a more user-friendly interface.

Reading both your answers, I'd tend to think that having a set of
hooks to satisfy all the potential user requests would be enough. We
could let the server code decide what is the priority of the standbys
using the information in synchronous_standby_names, then have the
hooks interact with SyncRepReleaseWaiters and pg_stat_get_wal_senders.

We would need two hooks:
- one able to get an array of WAL sender position defining all the
nodes considered as sync nodes. This is enough for
pg_stat_get_wal_senders. SyncRepReleaseWaiters would need it as
well...
- a second able to define the update policy of the write and flush
positions in walsndctl (SYNC_REP_WAIT_FLUSH and SYNC_REP_WAIT_WRITE),
as well as the next failover policy. This would be needed for
SyncRepReleaseWaiters when a WAL sender calls SyncRepReleaseWaiters.

Perhaps that's overthinking, but I am getting the impression that
whatever the decision taken, it would involve modifications of the
sync-standby parametrization at GUC level, and whatever the path
chosen (dedicated language, set of integer params), there will be
complains about what things can or cannot be done.

At least a set of hooks has the merit to say: do what you like with
your synchronous node policy.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Michael Paquier (#29)

Re: Support for N synchronous standby servers

On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

At least a set of hooks has the merit to say: do what you like with
your synchronous node policy.

Sure. I dunno if people will find that terribly user-friendly, so we
might not want that to be the ONLY thing we offer.

But even if it is, it is certainly better than a poke in the eye with
a sharp stick.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Robert Haas (#30)

Re: Support for N synchronous standby servers

On Tue, Sep 16, 2014 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

At least a set of hooks has the merit to say: do what you like with
your synchronous node policy.

Sure. I dunno if people will find that terribly user-friendly, so we
might not want that to be the ONLY thing we offer.

Well, user-friendly interface is actually the reason why a simple GUC
integer was used in the first series of patches present on this thread
to set as sync the N-nodes with the lowest priority. I could not come
up with something more simple. Hence what about doing the following:
- A patch refactoring code for pg_stat_get_wal_senders and
SyncRepReleaseWaiters as there is in either case duplicated code in
this area to select the synchronous node as the one connected with
lowest priority
- A patch defining the hooks necessary, I suspect that two of them are
necessary as mentioned upthread.
- A patch for a contrib module implementing an example of simple
policy. It can be a fancy thing with a custom language or even a more
simple thing.
Thoughts? Patch 1 refactoring the code is a win in all cases.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Michael Paquier (#31)

Re: Support for N synchronous standby servers

On Tue, Sep 16, 2014 at 2:19 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Sep 16, 2014 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

At least a set of hooks has the merit to say: do what you like with
your synchronous node policy.

Sure. I dunno if people will find that terribly user-friendly, so we
might not want that to be the ONLY thing we offer.

Well, user-friendly interface is actually the reason why a simple GUC
integer was used in the first series of patches present on this thread
to set as sync the N-nodes with the lowest priority. I could not come
up with something more simple. Hence what about doing the following:
- A patch refactoring code for pg_stat_get_wal_senders and
SyncRepReleaseWaiters as there is in either case duplicated code in
this area to select the synchronous node as the one connected with
lowest priority

A strong +1 for this idea. I have never liked that, and cleaning it
up seems eminently sensible.

- A patch defining the hooks necessary, I suspect that two of them are
necessary as mentioned upthread.
- A patch for a contrib module implementing an example of simple
policy. It can be a fancy thing with a custom language or even a more
simple thing.

I'm less convinced about this part. There's a big usability gap
between a GUC and a hook, and I think Heikki's comments upthread were
meant to suggest that even in GUC-land we can probably satisfy more
use cases that what this patch does now. I think that's right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Robert Haas (#32)

Re: Support for N synchronous standby servers

On Fri, Sep 19, 2014 at 12:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 16, 2014 at 2:19 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Sep 16, 2014 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 15, 2014 at 3:00 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

At least a set of hooks has the merit to say: do what you like with
your synchronous node policy.

Sure. I dunno if people will find that terribly user-friendly, so we
might not want that to be the ONLY thing we offer.

Well, user-friendly interface is actually the reason why a simple GUC
integer was used in the first series of patches present on this thread
to set as sync the N-nodes with the lowest priority. I could not come
up with something more simple. Hence what about doing the following:
- A patch refactoring code for pg_stat_get_wal_senders and
SyncRepReleaseWaiters as there is in either case duplicated code in
this area to select the synchronous node as the one connected with
lowest priority

A strong +1 for this idea. I have never liked that, and cleaning it
up seems eminently sensible.

Interestingly, the syncrep code has in some of its code paths the idea
that a synchronous node is unique, while other code paths assume that
there can be multiple synchronous nodes. If that's fine I think that
it would be better to just make the code multiple-sync node aware, by
having a single function call in walsender.c and syncrep.c that
returns an integer array of WAL sender positions (WalSndCtl). as that
seems more extensible long-term. Well for now the array would have a
single element, being the WAL sender with lowest priority > 0. Feel
free to protest about that approach though :)

- A patch defining the hooks necessary, I suspect that two of them are
necessary as mentioned upthread.
- A patch for a contrib module implementing an example of simple
policy. It can be a fancy thing with a custom language or even a more
simple thing.

I'm less convinced about this part. There's a big usability gap
between a GUC and a hook, and I think Heikki's comments upthread were
meant to suggest that even in GUC-land we can probably satisfy more
use cases that what this patch does now. I think that's right.

Hehe. OK then let's see how something with a GUC would go then. There
is no parameter now using a custom language as format base, but I
guess that it is fine to have a text parameter with a validation
callback within. No?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers