Quorum commit for multiple synchronous replication.

Started by Masahiko Sawadaover 9 years ago128 messages

sawada.mshk@gmail.com

over 9 years ago

2 attachment(s)

Hi all,

In 9.6 development cycle, we had been discussed about configuration
syntax for a long time while considering expanding.
As a result, we had new dedicated language for multiple synchronous
replication, but it supports only priority method.
We know that quorum commit is very useful for many users and can
expand dedicated language easily for quorum commit.
So I'd like to propose quorum commit for multiple synchronous replication here.

The followings are changes attached patches made.
- Add new syntax 'Any N ( node1, node2, ... )' to
synchornous_standby_names for quorum commit.
- In quorum commit, the master can return commit to client after
received ACK from *at least* any N servers of listed standbys.
- sync_priority of all listed servers are same, 1.
- Add regression test for quorum commit.

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

Regards,

--
Masahiko Sawada

Attachments:

000_quorum_commit_v1.patchtext/x-patch; charset=US-ASCII; name=000_quorum_commit_v1.patchDownload

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 67249d8..0ce5399 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -76,9 +76,9 @@ char	   *SyncRepStandbyNames;
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
-static bool announce_next_takeover = true;
+SyncRepConfigData *SyncRepConfig = NULL;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
+static bool announce_next_takeover = true;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
@@ -89,7 +89,12 @@ static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 						   XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr,
 						   bool *am_sync);
+static bool SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr,
+						   XLogRecPtr *flushPtr,
+						   XLogRecPtr *applyPtr,
+						   int pos, bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -391,7 +396,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -418,11 +423,16 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		got_recptr = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
+											 &applyPtr, &am_sync);
+	else /* SYNC_REP_QUORUM */
+		got_recptr = SyncRepGetNNewestSyncRecPtr(&writePtr, &flushPtr,
+											  &applyPtr, SyncRepConfig->num_sync,
+											  &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -440,7 +450,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -476,6 +486,88 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
+ * Calculate the 'pos' newest Write, Flush and Apply positions among sync standbys.
+ *
+ * Return false if the number of sync standbys is less than
+ * synchronous_standby_names specifies. Otherwise return true and
+ * store the 'pos' newest positions into *writePtr, *flushPtr, *applyPtr.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static bool
+SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+						XLogRecPtr *applyPtr, int pos, bool *am_sync)
+{
+	XLogRecPtr	*write_array;
+	XLogRecPtr	*flush_array;
+	XLogRecPtr	*apply_array;
+	List	   *sync_standbys;
+	ListCell   *cell;
+	int			len;
+	int			i = 0;
+
+	*writePtr = InvalidXLogRecPtr;
+	*flushPtr = InvalidXLogRecPtr;
+	*applyPtr = InvalidXLogRecPtr;
+	*am_sync = false;
+
+	/* Get standbys that are considered as synchronous at this moment */
+	sync_standbys = SyncRepGetSyncStandbys(am_sync);
+
+	/*
+	 * Quick exit if we are not managing a sync standby or there are not
+	 * enough synchronous standbys.
+	 */
+	if (!(*am_sync) ||
+		SyncRepConfig == NULL ||
+		list_length(sync_standbys) < SyncRepConfig->num_sync)
+	{
+		list_free(sync_standbys);
+		return false;
+	}
+
+	len = list_length(sync_standbys);
+	write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+	flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+	apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+	/*
+	 * Scan through all sync standbys and calculate 'pos' Newest
+	 * Write, Flush and Apply positions.
+	 */
+	foreach (cell, sync_standbys)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+		SpinLockAcquire(&walsnd->mutex);
+		write_array[i] = walsnd->write;
+		flush_array[i]= walsnd->flush;
+		apply_array[i] = walsnd->flush;
+		SpinLockRelease(&walsnd->mutex);
+
+		i++;
+	}
+
+	/* Sort each array in descending order to get 'pos' newest element */
+	qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+	qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+	qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+	/* Get 'pos' newest Write, Flush, Apply positions */
+	*writePtr = write_array[pos - 1];
+	*flushPtr = flush_array[pos - 1];
+	*applyPtr = apply_array[pos - 1];
+
+	pfree(write_array);
+	pfree(flush_array);
+	pfree(apply_array);
+	list_free(sync_standbys);
+
+	return true;
+}
+
+/*
  * Calculate the oldest Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
@@ -513,12 +605,12 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	}
 
 	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
+	 * Scan through all sync standbys and calculate the oldest
+	 * Write, Flush and Apply positions.
 	 */
-	foreach(cell, sync_standbys)
+	foreach (cell, sync_standbys)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+		WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
@@ -542,17 +634,88 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected. The caller must hold SyncRepLock.
  *
- * If there are multiple standbys with the same priority,
- * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else /* SYNC_REP_QUORUM */
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+}
+
+/*
+ * Return the list of sync standbys using quorum method, or
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
+ * the first one found is selected perferentially.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 List *
-SyncRepGetSyncStandbys(bool *am_sync)
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -565,14 +728,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
-
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
 
@@ -754,6 +909,10 @@ SyncRepGetStandbyPriority(void)
 		standby_name += strlen(standby_name) + 1;
 	}
 
+	/* In quroum method, all sync standby priorities are always 1 */
+	if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		priority = 1;
+
 	return (found ? priority : 0);
 }
 
@@ -897,6 +1056,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..7026a96 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,9 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list					{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'		{ $$ = create_syncrep_config($1, $3, SYNC_REP_PRIORITY); }
+		| ANY NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
 	;
 
 standby_list:
@@ -77,7 +78,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +99,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..e229663 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,7 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any|ANY|Any
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +65,10 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a0dba19..16ad2f8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2861,7 +2861,8 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..4ec1e47 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int			sync_method;	/* synchronous method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
@@ -68,6 +75,8 @@ extern void SyncRepReleaseWaiters(void);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
+extern List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+extern List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
 
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);

001_add_regression_test_v1.patchtext/x-patch; charset=US-ASCII; name=001_add_regression_test_v1.patchDownload

diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index baf4477..6fa5522 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 10;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -172,3 +172,25 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check that the state of standbys listed as a voter are having
+# same priority when synchronous_standby_names uses quorum method.
+test_sync_state(
+	$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby4|0|async),
+	'2 quorum and 1 async',
+	'Any 2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'Any 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+	$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+	'all standbys are considered as voter for quorum commit',
+	'Any 2(*)');

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#1)

Re: Quorum commit for multiple synchronous replication.

On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

I am +1 for using either "{}" or "[]" to define a quorum set, and -1
for the addition of a keyword in front of the integer defining for how
many nodes server need to wait for.

-    foreach(cell, sync_standbys)
+    foreach (cell, sync_standbys)
     {
-        WalSnd       *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+        WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
This patch has some noise.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Michael Paquier (#2)

Re: Quorum commit for multiple synchronous replication.

On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

I am +1 for using either "{}" or "[]" to define a quorum set, and -1
for the addition of a keyword in front of the integer defining for how
many nodes server need to wait for.

Thank you for reply.
"{}" or "[]" are not bad but because these are not intuitive, I
thought that it will be hard for uses to use different method for each
purpose.

-    foreach(cell, sync_standbys)
+    foreach (cell, sync_standbys)
{
-        WalSnd       *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+        WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
This patch has some noise.

Will fix.

--
Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr@2ndquadrant.com

over 9 years ago

In reply to: Masahiko Sawada (#3)

Re: Quorum commit for multiple synchronous replication.

On 04/08/16 06:40, Masahiko Sawada wrote:

On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

I am +1 for using either "{}" or "[]" to define a quorum set, and -1
for the addition of a keyword in front of the integer defining for how
many nodes server need to wait for.

Thank you for reply.
"{}" or "[]" are not bad but because these are not intuitive, I
thought that it will be hard for uses to use different method for each
purpose.

I think the "any" keyword is more explicit and understandable, also
closer to SQL. So I would be in favor of doing that.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

over 9 years ago

In reply to: Petr Jelinek (#4)

Re: Quorum commit for multiple synchronous replication.

On Sat, Aug 6, 2016 at 6:36 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

On 04/08/16 06:40, Masahiko Sawada wrote:

On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

I am +1 for using either "{}" or "[]" to define a quorum set, and -1
for the addition of a keyword in front of the integer defining for how
many nodes server need to wait for.

Thank you for reply.
"{}" or "[]" are not bad but because these are not intuitive, I
thought that it will be hard for uses to use different method for each
purpose.

I think the "any" keyword is more explicit and understandable, also closer
to SQL. So I would be in favor of doing that.

Also I like the following Simon's idea.

/messages/by-id/CANP8+jLHfBVv_pW6grASNUpW+bdk5DcTu7GWpNAP-+-ZWvKT6w@mail.gmail.com
-----------------------
* first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
* any k (n1, n2, n3) – would release waiters as soon as we have the
responses from k out of N standbys. “any k” would be faster, so is
desirable for performance and resilience
-----------------------

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndquadrant.com

over 9 years ago

In reply to: Fujii Masao (#5)

Re: Quorum commit for multiple synchronous replication.

On 29 August 2016 at 14:52, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Aug 6, 2016 at 6:36 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

On 04/08/16 06:40, Masahiko Sawada wrote:

On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

I am +1 for using either "{}" or "[]" to define a quorum set, and -1
for the addition of a keyword in front of the integer defining for how
many nodes server need to wait for.

Thank you for reply.
"{}" or "[]" are not bad but because these are not intuitive, I
thought that it will be hard for uses to use different method for each
purpose.

I think the "any" keyword is more explicit and understandable, also closer
to SQL. So I would be in favor of doing that.

+1

Also I like the following Simon's idea.

/messages/by-id/CANP8+jLHfBVv_pW6grASNUpW+bdk5DcTu7GWpNAP-+-ZWvKT6w@mail.gmail.com
-----------------------
* first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
* any k (n1, n2, n3) – would release waiters as soon as we have the
responses from k out of N standbys. “any k” would be faster, so is
desirable for performance and resilience
-----------------------

"synchronous_method" -> "synchronization_method"

I'm concerned about the performance of this code. Can we work out a
way of measuring it, so we can judge how successful we are at
releasing waiters quickly? Thanks

For 9.6 we implemented something that allows the DBA to define how
slow programs are. Previously, since 9.1 this was something specified
on the application side. I would like to put it back that way, so we
end up with a parameter on client e.g. commit_quorum = k. Forget the
exact parameters/user API for now, but I'd like to allow the code to
work with user defined settings. Thanks.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Simon Riggs (#6)

Re: Quorum commit for multiple synchronous replication.

On Tue, Sep 6, 2016 at 11:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 29 August 2016 at 14:52, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Aug 6, 2016 at 6:36 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

On 04/08/16 06:40, Masahiko Sawada wrote:

On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

I am +1 for using either "{}" or "[]" to define a quorum set, and -1
for the addition of a keyword in front of the integer defining for how
many nodes server need to wait for.

Thank you for reply.
"{}" or "[]" are not bad but because these are not intuitive, I
thought that it will be hard for uses to use different method for each
purpose.

I think the "any" keyword is more explicit and understandable, also closer
to SQL. So I would be in favor of doing that.

+1

Also I like the following Simon's idea.

/messages/by-id/CANP8+jLHfBVv_pW6grASNUpW+bdk5DcTu7GWpNAP-+-ZWvKT6w@mail.gmail.com
-----------------------
* first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
* any k (n1, n2, n3) – would release waiters as soon as we have the
responses from k out of N standbys. “any k” would be faster, so is
desirable for performance and resilience
-----------------------

+1

"synchronous_method" -> "synchronization_method"

Thanks, will fix.

I'm concerned about the performance of this code. Can we work out a
way of measuring it, so we can judge how successful we are at
releasing waiters quickly? Thanks

I will measure the performance effect of this code.
I'm expecting that performances are,
'first 1 (n1, n2)' > 'any 1(n1, n2)' > 'first 2(n1, n2)'
'first 1 (n1, n2)' will be highest throughput.

For 9.6 we implemented something that allows the DBA to define how
slow programs are. Previously, since 9.1 this was something specified
on the application side. I would like to put it back that way, so we
end up with a parameter on client e.g. commit_quorum = k. Forget the
exact parameters/user API for now, but I'd like to allow the code to
work with user defined settings. Thanks.

I see. The parameter on client should effect for priority method as well.
And similar to synchronous_commit, the client can specify the how much
standbys the master waits to commit for according to synchronization
method, even if s_s_names is defined.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Josh Berkus

josh@agliodbs.com

over 9 years ago

In reply to: Masahiko Sawada (#1)

Re: Quorum commit for multiple synchronous replication.

On 08/29/2016 06:52 AM, Fujii Masao wrote:

Also I like the following Simon's idea.

/messages/by-id/CANP8+jLHfBVv_pW6grASNUpW+bdk5DcTu7GWpNAP-+-ZWvKT6w@mail.gmail.com
-----------------------
* first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
* any k (n1, n2, n3) – would release waiters as soon as we have the
responses from k out of N standbys. “any k” would be faster, so is
desirable for performance and resilience

What are we going to do for backwards compatibility, here?

So, here's the dilemma:

If we want to keep backwards compatibility with 9.6, then:

"k (n1, n2, n3)" == "first k (n1, n2, n3)"

However, "first k" is not what most users will want, most of the time;
users of version 13, years from now, will be getting constantly confused
by "first k" behavior when they wanted quorum. So the sensible default
would be:

"k (n1, n2, n3)" == "any k (n1, n2, n3)"

... however, that will break backwards compatibility. Thoughts?

My $0.02 is that we break backwards compat somehow and document the heck
out of it.

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM6baa05ae4bfb1e2b838bcc0b4962c6cdea9721e2dee31fb33a39f6f8dca9e5a39fa9611f4f8a1b0c37fd2844f64ca0f5@mailstronghold-3.zmailcloud.com

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#7)

Re: Quorum commit for multiple synchronous replication.

On Wed, Sep 7, 2016 at 12:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Sep 6, 2016 at 11:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 29 August 2016 at 14:52, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Aug 6, 2016 at 6:36 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

On 04/08/16 06:40, Masahiko Sawada wrote:

On Wed, Aug 3, 2016 at 3:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 3, 2016 at 2:52 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I was thinking that the syntax for quorum method would use '[ ... ]'
but it will be confused with '( ... )' priority method used.
001 patch adds 'Any N ( ... )' style syntax but I know that we still
might need to discuss about better syntax, discussion is very welcome.
Attached draft patch, please give me feedback.

I am +1 for using either "{}" or "[]" to define a quorum set, and -1
for the addition of a keyword in front of the integer defining for how
many nodes server need to wait for.

Thank you for reply.
"{}" or "[]" are not bad but because these are not intuitive, I
thought that it will be hard for uses to use different method for each
purpose.

I think the "any" keyword is more explicit and understandable, also closer
to SQL. So I would be in favor of doing that.

+1

Also I like the following Simon's idea.

/messages/by-id/CANP8+jLHfBVv_pW6grASNUpW+bdk5DcTu7GWpNAP-+-ZWvKT6w@mail.gmail.com
-----------------------
* first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
* any k (n1, n2, n3) – would release waiters as soon as we have the
responses from k out of N standbys. “any k” would be faster, so is
desirable for performance and resilience
-----------------------

+1

"synchronous_method" -> "synchronization_method"

Thanks, will fix.

I'm concerned about the performance of this code. Can we work out a
way of measuring it, so we can judge how successful we are at
releasing waiters quickly? Thanks

I will measure the performance effect of this code.
I'm expecting that performances are,
'first 1 (n1, n2)' > 'any 1(n1, n2)' > 'first 2(n1, n2)'
'first 1 (n1, n2)' will be highest throughput.

Sorry, that's wrong.
'any 1(n1, n2)' will be highest throughput or same as 'first 1(n1, n2)'.

For 9.6 we implemented something that allows the DBA to define how
slow programs are. Previously, since 9.1 this was something specified
on the application side. I would like to put it back that way, so we
end up with a parameter on client e.g. commit_quorum = k. Forget the
exact parameters/user API for now, but I'd like to allow the code to
work with user defined settings. Thanks.

I see. The parameter on client should effect for priority method as well.
And similar to synchronous_commit, the client can specify the how much
standbys the master waits to commit for according to synchronization
method, even if s_s_names is defined.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Josh Berkus (#8)

Re: Quorum commit for multiple synchronous replication.

On Wed, Sep 7, 2016 at 4:03 AM, Josh Berkus <josh@agliodbs.com> wrote:

On 08/29/2016 06:52 AM, Fujii Masao wrote:

Also I like the following Simon's idea.

/messages/by-id/CANP8+jLHfBVv_pW6grASNUpW+bdk5DcTu7GWpNAP-+-ZWvKT6w@mail.gmail.com
-----------------------
* first k (n1, n2, n3) – does the same as k (n1, n2, n3) does now
* any k (n1, n2, n3) – would release waiters as soon as we have the
responses from k out of N standbys. “any k” would be faster, so is
desirable for performance and resilience

What are we going to do for backwards compatibility, here?

So, here's the dilemma:

If we want to keep backwards compatibility with 9.6, then:

"k (n1, n2, n3)" == "first k (n1, n2, n3)"

However, "first k" is not what most users will want, most of the time;
users of version 13, years from now, will be getting constantly confused
by "first k" behavior when they wanted quorum. So the sensible default
would be:

"k (n1, n2, n3)" == "any k (n1, n2, n3)"

+1.

"k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
compatibility but most users would think "k(n1, n2, n3)" as quorum
after introduced quorum.
I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
n3)" style before 9.6 releasing if we got consensus.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#10)

Re: Quorum commit for multiple synchronous replication.

On Thu, Sep 8, 2016 at 6:26 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

"k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
compatibility but most users would think "k(n1, n2, n3)" as quorum
after introduced quorum.
I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
n3)" style before 9.6 releasing if we got consensus.

Considering breaking backward-compatibility in the next release does
not sound like a good idea to me for a new feature that is going to be
GA soon.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Vik Fearing

vik@2ndquadrant.fr

over 9 years ago

In reply to: Michael Paquier (#11)

Re: Quorum commit for multiple synchronous replication.

On 09/09/2016 03:28 AM, Michael Paquier wrote:

On Thu, Sep 8, 2016 at 6:26 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

"k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
compatibility but most users would think "k(n1, n2, n3)" as quorum
after introduced quorum.
I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
n3)" style before 9.6 releasing if we got consensus.

Considering breaking backward-compatibility in the next release does
not sound like a good idea to me for a new feature that is going to be
GA soon.

Indeed. I'll vote for pulling a fast one on 9.6 for this.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Petr Jelinek

petr@2ndquadrant.com

over 9 years ago

In reply to: Vik Fearing (#12)

Re: Quorum commit for multiple synchronous replication.

On 09/09/16 08:23, Vik Fearing wrote:

On 09/09/2016 03:28 AM, Michael Paquier wrote:

On Thu, Sep 8, 2016 at 6:26 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

"k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
compatibility but most users would think "k(n1, n2, n3)" as quorum
after introduced quorum.
I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
n3)" style before 9.6 releasing if we got consensus.

Considering breaking backward-compatibility in the next release does
not sound like a good idea to me for a new feature that is going to be
GA soon.

Indeed. I'll vote for pulling a fast one on 9.6 for this.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Simon Riggs

simon@2ndquadrant.com

over 9 years ago

In reply to: Masahiko Sawada (#10)

Re: Quorum commit for multiple synchronous replication.

On 8 September 2016 at 10:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

"k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
compatibility but most users would think "k(n1, n2, n3)" as quorum
after introduced quorum.
I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
n3)" style before 9.6 releasing if we got consensus.

Let's see the proposed patch, so we can evaluate the proposal.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Simon Riggs (#14)

2 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Fri, Sep 9, 2016 at 6:23 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 September 2016 at 10:26, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

"k (n1, n2, n3)" == "first k (n1, n2, n3)" doesn't break backward
compatibility but most users would think "k(n1, n2, n3)" as quorum
after introduced quorum.
I wish we can change the s_s_names syntax of 9.6 to "first k(n1, n2,
n3)" style before 9.6 releasing if we got consensus.

Let's see the proposed patch, so we can evaluate the proposal.

Attached 2 patches.
000 patch changes syntax of s_s_names from 'k(n1, n2, n3)' to 'First k
(n1, n2,n3)' for PG9.6.
001 patch adds the quorum commit using syntax 'Any k (n1, n2,n3)' for PG10.

Since we already released 9.6RC1, I understand that it's quite hard to
change syntax of 9.6.
But considering that we support the quorum commit, this could be one
of the solutions in order to avoid breaking backward compatibility and
to provide useful user interface.
So I attached these patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_change_syntax_96.patchtext/x-patch; charset=US-ASCII; name=000_change_syntax_96.patchDownload

commit bd18dda9be5ab0341eca81de3c48ec6f7466dded
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Date:   Fri Sep 16 15:32:24 2016 -0700

    Change syntax

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cd66abc..f0f510c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3037,7 +3037,7 @@ include_dir 'conf.d'
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
@@ -3048,7 +3048,9 @@ include_dir 'conf.d'
         <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
         until their WAL records are received by three higher-priority standbys
         chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
+        <literal>s3</> and <literal>s4</>. <literal>FIRST</> is
+        case-insensitive but the standby having name <literal>FIRST</>
+        must be double-quoted.
         </para>
         <para>
         The second syntax was used before <productname>PostgreSQL</>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 06f49db..84ccb6e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..b6d2f6c 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -61,7 +61,7 @@ result:
 
 standby_config:
 		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $4); }
 	;
 
 standby_list:
@@ -70,7 +70,7 @@ standby_list:
 	;
 
 standby_name:
-		NAME						{ $$ = $1; }
+		NAME						{ $$ = $1;}
 		| NUM						{ $$ = $1; }
 	;
 %%
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..9dbdfbc 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -64,6 +64,11 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+first		{
+				yylval.str = pstrdup(yytext);
+				return FIRST;
+		}
+
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);

001_quorum_commit_v2.patchtext/x-patch; charset=US-ASCII; name=001_quorum_commit_v2.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f0f510c..0ad06ad 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3021,44 +3021,68 @@ include_dir 'conf.d'
         There will be one or more active synchronous standbys;
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
-        The synchronous standbys will be those whose names appear
-        earlier in this list, and
-        that are both currently connected and streaming data in real-time
-        (as shown by a state of <literal>streaming</literal> in the
-        <link linkend="monitoring-stats-views-table">
-        <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
         Specifying more than one standby name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ FIRST | ANY ] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
+
+
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>. <literal>FIRST</> is
-        case-insensitive but the standby having name <literal>FIRST</>
-        must be double-quoted.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method of
+        that how master server controls the standby servers.
+        </para>
+
+        <para>
+        <literal>FIRST</> means to control the standby servers with
+        different priorities. The synchronous standbys will be those
+        whose name appear earlier in this list, and that are both
+        currently connected and streaming data in real-time(as shown
+        by a state of <literal>streaming</> in the
+        <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view). Other standby
+        servers appearing later in this list represent potential
+        synchronous standbys. If any of the current synchronous
+        standbys disconnects for whatever reason, it will be replaced
+        immediately with the next-highest-priority standby.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        </para>
+
+       <para>
+       <literal>ANY</> means to control all of standby servers with
+       same priority. The master sever will wait for receipt from
+       at least <replaceable class="parameter">num_sync</replaceable>
+       standbys, which is quorum commit in the literature. The all of
+       listed standbys are considered as candidate of quorum commit.
+       For example, a setting of <literal> ANY 3 (s1, s2, s3, s4)</> makes
+       transaction commits wait until receiving receipts from at least
+       any three standbys of four listed servers <literal>s1</>,
+       <literal>s2</>, <literal>s3</>, <literal>s4</>.
+       </para>
+
+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive word
+        and the standby name having these words are must be double-quoted.
         </para>
+
         <para>
         The second syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
+        with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+        equal to 1. For example, <literal>1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
        </para>
        <para>
         The name of a standby server for this purpose is the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 84ccb6e..8a9e65d 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1134,7 +1134,7 @@ primary_slot_name = 'node_a_slot'
 
    <para>
     Synchronous replication supports one or more synchronous standby servers;
-    transactions will wait until all the standby servers which are considered
+    transactions will wait until the multiple standby servers which are considered
     as synchronous confirm receipt of their data. The number of synchronous
     standbys that transactions must wait for replies from is specified in
     <varname>synchronous_standby_names</>. This parameter also specifies
@@ -1161,6 +1161,18 @@ synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
     <literal>s2</> fails. <literal>s4</> is an asynchronous standby since
     its name is not in the list.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the all of listed standbys
+    will be considered as candidate of quorum commit. The master server will
+    wait for at least 2 replies from any of three standbys. <literal>s4</> is
+    an asynchronous standby since its name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0776428..ad2c8e6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1220,7 +1220,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <row>
      <entry><structfield>sync_state</></entry>
      <entry><type>text</></entry>
-     <entry>Synchronous state of this standby server</entry>
+     <entry>Synchronous state of this standby server. <literal>quorum</>
+      when standby is considered as a candidate of quorum commit.</entry>
     </row>
    </tbody>
    </tgroup>
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index b442d06..bc67fce 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -76,9 +76,9 @@ char	   *SyncRepStandbyNames;
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
-static bool announce_next_takeover = true;
+SyncRepConfigData *SyncRepConfig = NULL;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
+static bool announce_next_takeover = true;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
@@ -89,7 +89,12 @@ static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 						   XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr,
 						   bool *am_sync);
+static bool SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr,
+						   XLogRecPtr *flushPtr,
+						   XLogRecPtr *applyPtr,
+						   int pos, bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -384,7 +389,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -411,11 +416,16 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		got_recptr = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
+											 &applyPtr, &am_sync);
+	else /* SYNC_REP_QUORUM */
+		got_recptr = SyncRepGetNNewestSyncRecPtr(&writePtr, &flushPtr,
+											  &applyPtr, SyncRepConfig->num_sync,
+											  &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -433,7 +443,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -469,6 +479,88 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
+ * Calculate the 'pos' newest Write, Flush and Apply positions among sync standbys.
+ *
+ * Return false if the number of sync standbys is less than
+ * synchronous_standby_names specifies. Otherwise return true and
+ * store the 'pos' newest positions into *writePtr, *flushPtr, *applyPtr.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static bool
+SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+						XLogRecPtr *applyPtr, int pos, bool *am_sync)
+{
+	XLogRecPtr	*write_array;
+	XLogRecPtr	*flush_array;
+	XLogRecPtr	*apply_array;
+	List	   *sync_standbys;
+	ListCell   *cell;
+	int			len;
+	int			i = 0;
+
+	*writePtr = InvalidXLogRecPtr;
+	*flushPtr = InvalidXLogRecPtr;
+	*applyPtr = InvalidXLogRecPtr;
+	*am_sync = false;
+
+	/* Get standbys that are considered as synchronous at this moment */
+	sync_standbys = SyncRepGetSyncStandbys(am_sync);
+
+	/*
+	 * Quick exit if we are not managing a sync standby or there are not
+	 * enough synchronous standbys.
+	 */
+	if (!(*am_sync) ||
+		SyncRepConfig == NULL ||
+		list_length(sync_standbys) < SyncRepConfig->num_sync)
+	{
+		list_free(sync_standbys);
+		return false;
+	}
+
+	len = list_length(sync_standbys);
+	write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+	flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+	apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+	/*
+	 * Scan through all sync standbys and calculate 'pos' Newest
+	 * Write, Flush and Apply positions.
+	 */
+	foreach (cell, sync_standbys)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+		SpinLockAcquire(&walsnd->mutex);
+		write_array[i] = walsnd->write;
+		flush_array[i]= walsnd->flush;
+		apply_array[i] = walsnd->flush;
+		SpinLockRelease(&walsnd->mutex);
+
+		i++;
+	}
+
+	/* Sort each array in descending order to get 'pos' newest element */
+	qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+	qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+	qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+	/* Get 'pos' newest Write, Flush, Apply positions */
+	*writePtr = write_array[pos - 1];
+	*flushPtr = flush_array[pos - 1];
+	*applyPtr = apply_array[pos - 1];
+
+	pfree(write_array);
+	pfree(flush_array);
+	pfree(apply_array);
+	list_free(sync_standbys);
+
+	return true;
+}
+
+/*
  * Calculate the oldest Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
@@ -506,12 +598,12 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	}
 
 	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
+	 * Scan through all sync standbys and calculate the oldest
+	 * Write, Flush and Apply positions.
 	 */
-	foreach(cell, sync_standbys)
+	foreach (cell, sync_standbys)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+		WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
@@ -535,17 +627,88 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected. The caller must hold SyncRepLock.
  *
- * If there are multiple standbys with the same priority,
- * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else /* SYNC_REP_QUORUM */
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+}
+
+/*
+ * Return the list of sync standbys using quorum method, or
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
+ * the first one found is selected perferentially.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 List *
-SyncRepGetSyncStandbys(bool *am_sync)
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -558,14 +721,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
-
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
 
@@ -747,6 +902,10 @@ SyncRepGetStandbyPriority(void)
 		standby_name += strlen(standby_name) + 1;
 	}
 
+	/* In quroum method, all sync standby priorities are always 1 */
+	if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		priority = 1;
+
 	return (found ? priority : 0);
 }
 
@@ -890,6 +1049,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index b6d2f6c..e1335d1 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK FIRST
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,9 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $4); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +78,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +99,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index 9dbdfbc..319abdc 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -69,6 +69,10 @@ first		{
 				return FIRST;
 		}
 
+any		{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c7743da..00467a4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2862,7 +2862,8 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index f614805..18793f3 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -124,7 +124,6 @@ typedef struct xl_brin_revmap_extend
 #define SizeOfBrinRevmapExtend	(offsetof(xl_brin_revmap_extend, targetBlk) + \
 								 sizeof(BlockNumber))
 
-
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *brin_identify(uint8 info);
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..4ec1e47 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int			sync_method;	/* synchronous method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
@@ -68,6 +75,8 @@ extern void SyncRepReleaseWaiters(void);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
+extern List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+extern List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
 
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);

#16

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#15)

Re: Quorum commit for multiple synchronous replication.

On Sat, Sep 17, 2016 at 2:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since we already released 9.6RC1, I understand that it's quite hard to
change syntax of 9.6.
But considering that we support the quorum commit, this could be one
of the solutions in order to avoid breaking backward compatibility and
to provide useful user interface.
So I attached these patches.

 standby_config:
-        standby_list                { $$ = create_syncrep_config("1", $1); }
-        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($1, $4); }
+        standby_list                        { $$ =
create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+        | ANY NUM '(' standby_list ')'        { $$ =
create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }

Reading again the thread, it seems that my previous post [1]/messages/by-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4nJgmMgiAgvcSDKA@mail.gmail.com -- Michael was a bit
misunderstood. My position is to not introduce any new behavior
changes in 9.6, so we could just make the FIRST NUM grammar equivalent
to NUM.

[1]: /messages/by-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4nJgmMgiAgvcSDKA@mail.gmail.com -- Michael
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Vik Fearing

vik@2ndquadrant.fr

over 9 years ago

In reply to: Michael Paquier (#16)

Re: Quorum commit for multiple synchronous replication.

On 09/21/2016 08:30 AM, Michael Paquier wrote:

On Sat, Sep 17, 2016 at 2:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since we already released 9.6RC1, I understand that it's quite hard to
change syntax of 9.6.
But considering that we support the quorum commit, this could be one
of the solutions in order to avoid breaking backward compatibility and
to provide useful user interface.
So I attached these patches.
standby_config:
-        standby_list                { $$ = create_syncrep_config("1", $1); }
-        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($1, $4); }
+        standby_list                        { $$ =
create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+        | ANY NUM '(' standby_list ')'        { $$ =
create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
Reading again the thread, it seems that my previous post [1] was a bit
misunderstood. My position is to not introduce any new behavior
changes in 9.6, so we could just make the FIRST NUM grammar equivalent
to NUM.

[1]: /messages/by-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4nJgmMgiAgvcSDKA@mail.gmail.com

I misunderstood your intent, then. But I still stand by what I did
understand, namely that 'k (...)' should mean 'any k (...)'. It's much
more natural than having it mean 'first k (...)' and I also think it
will be more frequent in practice.
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Petr Jelinek

petr@2ndquadrant.com

over 9 years ago

In reply to: Vik Fearing (#17)

Re: Quorum commit for multiple synchronous replication.

On 21/09/16 09:18, Vik Fearing wrote:

On 09/21/2016 08:30 AM, Michael Paquier wrote:
On Sat, Sep 17, 2016 at 2:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since we already released 9.6RC1, I understand that it's quite hard to
change syntax of 9.6.
But considering that we support the quorum commit, this could be one
of the solutions in order to avoid breaking backward compatibility and
to provide useful user interface.
So I attached these patches.
standby_config:
-        standby_list                { $$ = create_syncrep_config("1", $1); }
-        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($1, $4); }
+        standby_list                        { $$ =
create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+        | ANY NUM '(' standby_list ')'        { $$ =
create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
Reading again the thread, it seems that my previous post [1] was a bit
misunderstood. My position is to not introduce any new behavior
changes in 9.6, so we could just make the FIRST NUM grammar equivalent
to NUM.

[1]: /messages/by-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4nJgmMgiAgvcSDKA@mail.gmail.com
I misunderstood your intent, then. But I still stand by what I did
understand, namely that 'k (...)' should mean 'any k (...)'. It's much
more natural than having it mean 'first k (...)' and I also think it
will be more frequent in practice.

I think so as well.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Petr Jelinek (#18)

Re: Quorum commit for multiple synchronous replication.

On Wed, Sep 21, 2016 at 5:54 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Reading again the thread, it seems that my previous post [1] was a bit
misunderstood. My position is to not introduce any new behavior
changes in 9.6, so we could just make the FIRST NUM grammar equivalent
to NUM.

[1]: /messages/by-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4nJgmMgiAgvcSDKA@mail.gmail.com

I misunderstood your intent, then. But I still stand by what I did
understand, namely that 'k (...)' should mean 'any k (...)'. It's much
more natural than having it mean 'first k (...)' and I also think it
will be more frequent in practice.

I think so as well.

Well, I agree, but I think making behavior changes after rc1 is a
non-starter. It's better to live with the incompatibility than to
change the behavior so close to release. At least, that's my
position. Getting the release out on time with a minimal bug count is
more important to me than a minor incompatibility in the meaning of
one GUC.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Robert Haas (#19)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Wed, Sep 21, 2016 at 11:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 21, 2016 at 5:54 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Reading again the thread, it seems that my previous post [1] was a bit
misunderstood. My position is to not introduce any new behavior
changes in 9.6, so we could just make the FIRST NUM grammar equivalent
to NUM.

[1]: /messages/by-id/CAB7nPqRDvJn18e54ccNpOP1A2_iUN6-iU=4nJgmMgiAgvcSDKA@mail.gmail.com

I misunderstood your intent, then. But I still stand by what I did
understand, namely that 'k (...)' should mean 'any k (...)'. It's much
more natural than having it mean 'first k (...)' and I also think it
will be more frequent in practice.

I think so as well.

Well, I agree, but I think making behavior changes after rc1 is a
non-starter. It's better to live with the incompatibility than to
change the behavior so close to release. At least, that's my
position. Getting the release out on time with a minimal bug count is
more important to me than a minor incompatibility in the meaning of
one GUC.

As the release team announced, it's better to postpone changing the
syntax of existing s_s_name.
I still vote for changing behaviour of existing syntax 'k (n1, n2)' to
quorum commit.
That is,
1. 'First k (n1, n2, n3)' means that the master server waits for ACKs
from k standby servers whose name appear earlier in the list.
2. 'Any k (n1, n2, n3)' means that the master server waits for ACKs
from any k listed standby servers.
3. 'n1, n2, n3' is the same as #1 with k=1.
4. '(n1, n2, n3)' is the same as #2 with k=1.

Attached updated patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

quorum_commit_v3.patchapplication/octet-stream; name=quorum_commit_v3.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a848a7e..31027da 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3026,42 +3026,62 @@ include_dir 'conf.d'
         There will be one or more active synchronous standbys;
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
-        The synchronous standbys will be those whose names appear
-        earlier in this list, and
-        that are both currently connected and streaming data in real-time
-        (as shown by a state of <literal>streaming</literal> in the
-        <link linkend="monitoring-stats-views-table">
-        <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method of
+        that how master server controls the standby servers.
+        </para>
+        <para>
+        <literal>FIRST</> means to control the standby servers with
+        different priorities. The synchronous standbys will be those
+        whose name appear earlier in this list, and that are both
+        currently connected and streaming data in real-time(as shown
+        by a state of <literal>streaming</> in the
+        <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view). Other standby
+        servers appearing later in this list represent potential
+        synchronous standbys. If any of the current synchronous
+        standbys disconnects for whatever reason, it will be replaced
+        immediately with the next-highest-priority standby.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        </para>
+        <para>
+        <literal>ANY</> means to control all of standby servers with
+        same priority. The master sever will wait for receipt from
+        at least <replaceable class="parameter">num_sync</replaceable>
+        standbys, which is quorum commit in the literature. The all of
+        listed standbys are considered as candidate of quorum commit.
+        For example, a setting of <literal> ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving receipts from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.
+        </para>
+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive word
+        and the standby name having these words are must be double-quoted.
         </para>
         <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
+        with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+        equal to 1. For example, <literal>1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
        </para>
        <para>
         The name of a standby server for this purpose is the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 06f49db..bd9f427 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1134,7 +1134,7 @@ primary_slot_name = 'node_a_slot'
 
    <para>
     Synchronous replication supports one or more synchronous standby servers;
-    transactions will wait until all the standby servers which are considered
+    transactions will wait until the multiple standby servers which are considered
     as synchronous confirm receipt of their data. The number of synchronous
     standbys that transactions must wait for replies from is specified in
     <varname>synchronous_standby_names</>. This parameter also specifies
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'First 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1161,6 +1161,18 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     <literal>s2</> fails. <literal>s4</> is an asynchronous standby since
     its name is not in the list.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'Any 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys <literal>s1</>,
+    <literal>s2</> and <literal>s3</> wil be considered as synchronous standby
+    condidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0776428..dd47839 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1220,7 +1220,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <row>
      <entry><structfield>sync_state</></entry>
      <entry><type>text</></entry>
-     <entry>Synchronous state of this standby server</entry>
+     <entry>Synchronous state of this standby server. <literal>quorum</>
+      when standby is considered as a condidate of quorum commit.</entry>
     </row>
    </tbody>
    </tgroup>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index b442d06..bc67fce 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -76,9 +76,9 @@ char	   *SyncRepStandbyNames;
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
-static bool announce_next_takeover = true;
+SyncRepConfigData *SyncRepConfig = NULL;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
+static bool announce_next_takeover = true;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
@@ -89,7 +89,12 @@ static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 						   XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr,
 						   bool *am_sync);
+static bool SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr,
+						   XLogRecPtr *flushPtr,
+						   XLogRecPtr *applyPtr,
+						   int pos, bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -384,7 +389,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -411,11 +416,16 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		got_recptr = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
+											 &applyPtr, &am_sync);
+	else /* SYNC_REP_QUORUM */
+		got_recptr = SyncRepGetNNewestSyncRecPtr(&writePtr, &flushPtr,
+											  &applyPtr, SyncRepConfig->num_sync,
+											  &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -433,7 +443,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -469,6 +479,88 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
+ * Calculate the 'pos' newest Write, Flush and Apply positions among sync standbys.
+ *
+ * Return false if the number of sync standbys is less than
+ * synchronous_standby_names specifies. Otherwise return true and
+ * store the 'pos' newest positions into *writePtr, *flushPtr, *applyPtr.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static bool
+SyncRepGetNNewestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+						XLogRecPtr *applyPtr, int pos, bool *am_sync)
+{
+	XLogRecPtr	*write_array;
+	XLogRecPtr	*flush_array;
+	XLogRecPtr	*apply_array;
+	List	   *sync_standbys;
+	ListCell   *cell;
+	int			len;
+	int			i = 0;
+
+	*writePtr = InvalidXLogRecPtr;
+	*flushPtr = InvalidXLogRecPtr;
+	*applyPtr = InvalidXLogRecPtr;
+	*am_sync = false;
+
+	/* Get standbys that are considered as synchronous at this moment */
+	sync_standbys = SyncRepGetSyncStandbys(am_sync);
+
+	/*
+	 * Quick exit if we are not managing a sync standby or there are not
+	 * enough synchronous standbys.
+	 */
+	if (!(*am_sync) ||
+		SyncRepConfig == NULL ||
+		list_length(sync_standbys) < SyncRepConfig->num_sync)
+	{
+		list_free(sync_standbys);
+		return false;
+	}
+
+	len = list_length(sync_standbys);
+	write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+	flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+	apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+	/*
+	 * Scan through all sync standbys and calculate 'pos' Newest
+	 * Write, Flush and Apply positions.
+	 */
+	foreach (cell, sync_standbys)
+	{
+		WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+		SpinLockAcquire(&walsnd->mutex);
+		write_array[i] = walsnd->write;
+		flush_array[i]= walsnd->flush;
+		apply_array[i] = walsnd->flush;
+		SpinLockRelease(&walsnd->mutex);
+
+		i++;
+	}
+
+	/* Sort each array in descending order to get 'pos' newest element */
+	qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+	qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+	qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+	/* Get 'pos' newest Write, Flush, Apply positions */
+	*writePtr = write_array[pos - 1];
+	*flushPtr = flush_array[pos - 1];
+	*applyPtr = apply_array[pos - 1];
+
+	pfree(write_array);
+	pfree(flush_array);
+	pfree(apply_array);
+	list_free(sync_standbys);
+
+	return true;
+}
+
+/*
  * Calculate the oldest Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
@@ -506,12 +598,12 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	}
 
 	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
+	 * Scan through all sync standbys and calculate the oldest
+	 * Write, Flush and Apply positions.
 	 */
-	foreach(cell, sync_standbys)
+	foreach (cell, sync_standbys)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+		WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
 		XLogRecPtr	write;
 		XLogRecPtr	flush;
 		XLogRecPtr	apply;
@@ -535,17 +627,88 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected. The caller must hold SyncRepLock.
  *
- * If there are multiple standbys with the same priority,
- * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else /* SYNC_REP_QUORUM */
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+}
+
+/*
+ * Return the list of sync standbys using quorum method, or
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
+ * the first one found is selected perferentially.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 List *
-SyncRepGetSyncStandbys(bool *am_sync)
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -558,14 +721,6 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
-
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
 
@@ -747,6 +902,10 @@ SyncRepGetStandbyPriority(void)
 		standby_name += strlen(standby_name) + 1;
 	}
 
+	/* In quroum method, all sync standby priorities are always 1 */
+	if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		priority = 1;
+
 	return (found ? priority : 0);
 }
 
@@ -890,6 +1049,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +79,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..403fd7d 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,14 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
+{first_ident}	{
+				yylval.str = pstrdup(yytext);
+				return FIRST;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c7743da..00467a4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2862,7 +2862,8 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..1b675ee 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int			sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
@@ -68,6 +75,8 @@ extern void SyncRepReleaseWaiters(void);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
+extern List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+extern List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
 
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..63cd88c 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 10;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
 	$node_master, qq(standby2|2|sync
 standby3|3|sync),
 	'2 synchronous standbys',
-	'2(standby1,standby2,standby3)');
+	'First 2(standby1,standby2,standby3)');
 
 # Start standby1
 $node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
 standby3|3|sync
 standby4|1|sync),
 	'num_sync exceeds the num of potential sync standbys',
-	'6(standby4,standby0,standby3,standby2)');
+	'First 6(standby4,standby0,standby3,standby2)');
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
 	'asterisk comes before another standby name',
-	'2(standby1,*,standby2)');
+	'First 2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
 # earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
 standby3|1|sync
 standby4|1|potential),
 	'multiple standbys having the same priority are chosen as sync',
-	'2(*)');
+	'First 2(*)');
 
 # Stop Standby3 which is considered in 'sync' state.
 $node_standby_3->stop;
@@ -172,3 +172,25 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check that the state of standbys listed as a voter are having
+# same priority when synchronous_standby_names uses quorum method.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby4|0|async),
+'2 quorum and 1 async',
+'Any 2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'Any 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+'all standbys are considered as condidates for quorum commit',
+'Any 2(*)');

#21

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#20)

Re: Quorum commit for multiple synchronous replication.

On Sat, Sep 24, 2016 at 5:37 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I still vote for changing behaviour of existing syntax 'k (n1, n2)' to
quorum commit.
That is,
1. 'First k (n1, n2, n3)' means that the master server waits for ACKs
from k standby servers whose name appear earlier in the list.
2. 'Any k (n1, n2, n3)' means that the master server waits for ACKs
from any k listed standby servers.
3. 'n1, n2, n3' is the same as #1 with k=1.
4. '(n1, n2, n3)' is the same as #2 with k=1.

OK, so I have done a review of this patch keeping that in mind as
that's the consensus. I am still getting familiar with the code...

-    transactions will wait until all the standby servers which are considered
+    transactions will wait until the multiple standby servers which
are considered
There is no real need to update this sentence.

+        <literal>FIRST</> means to control the standby servers with
+        different priorities. The synchronous standbys will be those
+        whose name appear earlier in this list, and that are both
+        currently connected and streaming data in real-time(as shown
+        by a state of <literal>streaming</> in the
+        <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view). Other standby
+        servers appearing later in this list represent potential
+        synchronous standbys. If any of the current synchronous
+        standbys disconnects for whatever reason, it will be replaced
+        immediately with the next-highest-priority standby.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
It does not seem necessary to me to enter in this level of details:
The keyword FIRST, coupled with an integer number N, chooses the first
N higher-priority standbys and makes transaction commit when their WAL
records are received. For example <literal>FIRST 3 (s1, s2, s3, s4)</>
makes transaction commits wait until their WAL records are received by
the three high-priority standbys chosen from standby servers s1, s2,
s3 and s4.

+        <literal>ANY</> means to control all of standby servers with
+        same priority. The master sever will wait for receipt from
+        at least <replaceable class="parameter">num_sync</replaceable>
+        standbys, which is quorum commit in the literature. The all of
+        listed standbys are considered as candidate of quorum commit.
+        For example, a setting of <literal> ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving receipts from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.

Similarly, something like that...
The keyword ANY, coupled with an integer number N, chooses N standbys
in a set of standbys with the same, lowest, priority and makes
transaction commit when WAL records are received those N standbys. For
example ANY 3(s1, s2, s3, s4) makes transaction commits wait until WAL
records have been received from 3 servers in the set s1, s2, s3 and
s4.

It could be good also to mention that no keyword specified means ANY,
which is incompatible with 9.6. The docs also miss the fact that if a
simple list of servers is given, without parenthesis and keywords,
this is equivalent to FIRST 1.

-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'First 2 (s1, s2, s3)'
Nit here. It may be a good idea to just use upper-case characters in
the docs, or just lower-case for consistency, but not mix both.
Usually GUCs use lower-case characters.

+ when standby is considered as a condidate of quorum commit.</entry>
s/condidate/candidate/

-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
Hm... Is that actually a good idea? Now "NODE" and "node" are two
different things for application_name, but with this patch both would
have the same meaning. I am getting to think that we could just use
the lower-case characters for the keywords any/first. Is this -i
switch a problem for elements in standby_list?

+ * Calculate the 'pos' newest Write, Flush and Apply positions among
sync standbys.
I don't understand this comment.

+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       got_recptr = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
+                                            &applyPtr, &am_sync);
+   else /* SYNC_REP_QUORUM */
+       got_recptr = SyncRepGetNNewestSyncRecPtr(&writePtr, &flushPtr,
+                                             &applyPtr,
SyncRepConfig->num_sync,
+                                             &am_sync);
Those could be grouped together, there is no need to have pos as an argument.

+   /* In quroum method, all sync standby priorities are always 1 */
+   if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+       priority = 1;
This is dead code, SyncRepGetSyncStandbysPriority is not called for
QUORUM. You may want to add an assert in
SyncRepGetSyncStandbysPriority and SyncRepGetSyncStandbysQuorum to be
sure that they are getting called for the correct method.

+   /* Sort each array in descending order to get 'pos' newest element */
+   qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
There is no need to reorder things again and to use arrays, you can
choose the newest LSNs when scanning the WalSnd entries.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#21)

Re: Quorum commit for multiple synchronous replication.

On Wed, Sep 28, 2016 at 5:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

OK, so I have done a review of this patch keeping that in mind as
that's the consensus. I am still getting familiar with the code...

Returned with feedback for now. This just needs polishing so feel free
to move it to the next CF once you have a new patch.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Michael Paquier (#21)

Re: Quorum commit for multiple synchronous replication.

On Wed, Sep 28, 2016 at 5:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Sep 24, 2016 at 5:37 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I still vote for changing behaviour of existing syntax 'k (n1, n2)' to
quorum commit.
That is,
1. 'First k (n1, n2, n3)' means that the master server waits for ACKs
from k standby servers whose name appear earlier in the list.
2. 'Any k (n1, n2, n3)' means that the master server waits for ACKs
from any k listed standby servers.
3. 'n1, n2, n3' is the same as #1 with k=1.
4. '(n1, n2, n3)' is the same as #2 with k=1.

OK, so I have done a review of this patch keeping that in mind as
that's the consensus. I am still getting familiar with the code...

Thank you for reviewing!

-    transactions will wait until all the standby servers which are considered
+    transactions will wait until the multiple standby servers which
are considered
There is no real need to update this sentence.

+        <literal>FIRST</> means to control the standby servers with
+        different priorities. The synchronous standbys will be those
+        whose name appear earlier in this list, and that are both
+        currently connected and streaming data in real-time(as shown
+        by a state of <literal>streaming</> in the
+        <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view). Other standby
+        servers appearing later in this list represent potential
+        synchronous standbys. If any of the current synchronous
+        standbys disconnects for whatever reason, it will be replaced
+        immediately with the next-highest-priority standby.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
It does not seem necessary to me to enter in this level of details:
The keyword FIRST, coupled with an integer number N, chooses the first
N higher-priority standbys and makes transaction commit when their WAL
records are received. For example <literal>FIRST 3 (s1, s2, s3, s4)</>
makes transaction commits wait until their WAL records are received by
the three high-priority standbys chosen from standby servers s1, s2,
s3 and s4.

Will fix.

+        <literal>ANY</> means to control all of standby servers with
+        same priority. The master sever will wait for receipt from
+        at least <replaceable class="parameter">num_sync</replaceable>
+        standbys, which is quorum commit in the literature. The all of
+        listed standbys are considered as candidate of quorum commit.
+        For example, a setting of <literal> ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving receipts from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.
Similarly, something like that...
The keyword ANY, coupled with an integer number N, chooses N standbys
in a set of standbys with the same, lowest, priority and makes
transaction commit when WAL records are received those N standbys. For
example ANY 3(s1, s2, s3, s4) makes transaction commits wait until WAL
records have been received from 3 servers in the set s1, s2, s3 and
s4.

Will fix.

It could be good also to mention that no keyword specified means ANY,
which is incompatible with 9.6. The docs also miss the fact that if a
simple list of servers is given, without parenthesis and keywords,
this is equivalent to FIRST 1.

Right. I will add those documentations.

-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'First 2 (s1, s2, s3)'
Nit here. It may be a good idea to just use upper-case characters in
the docs, or just lower-case for consistency, but not mix both.
Usually GUCs use lower-case characters.

Agree. Will fix.

+ when standby is considered as a condidate of quorum commit.</entry>
s/condidate/candidate/

Will fix.

-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
Hm... Is that actually a good idea? Now "NODE" and "node" are two
different things for application_name, but with this patch both would
have the same meaning. I am getting to think that we could just use
the lower-case characters for the keywords any/first. Is this -i
switch a problem for elements in standby_list?

The string of standby name is not changed actually, only the parser
doesn't distinguish between "NODE" and "node".
The values used for checking application_name will still works fine.
If we want to name "first" or "any" as the standby name then it should
be double quoted.

+ * Calculate the 'pos' newest Write, Flush and Apply positions among
sync standbys.
I don't understand this comment.

+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       got_recptr = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
+                                            &applyPtr, &am_sync);
+   else /* SYNC_REP_QUORUM */
+       got_recptr = SyncRepGetNNewestSyncRecPtr(&writePtr, &flushPtr,
+                                             &applyPtr,
SyncRepConfig->num_sync,
+                                             &am_sync);
Those could be grouped together, there is no need to have pos as an argument.

Will fix.

+   /* In quroum method, all sync standby priorities are always 1 */
+   if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+       priority = 1;
This is dead code, SyncRepGetSyncStandbysPriority is not called for
QUORUM.

Well, this code is in SyncRepGetStandbyPriority which is called by
SyncRepInitConifig.
SyncRepGetStandbyPriority can be called regardless of the the
synchronization method.

You may want to add an assert in
SyncRepGetSyncStandbysPriority and SyncRepGetSyncStandbysQuorum to be
sure that they are getting called for the correct method.
+   /* Sort each array in descending order to get 'pos' newest element */
+   qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
There is no need to reorder things again and to use arrays, you can
choose the newest LSNs when scanning the WalSnd entries.

I considered it that but it depends on performance.
Current patch avoids O(N*M).

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#23)

Re: Quorum commit for multiple synchronous replication.

On Tue, Oct 11, 2016 at 4:18 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

You may want to add an assert in
SyncRepGetSyncStandbysPriority and SyncRepGetSyncStandbysQuorum to be
sure that they are getting called for the correct method.
+   /* Sort each array in descending order to get 'pos' newest element */
+   qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
There is no need to reorder things again and to use arrays, you can
choose the newest LSNs when scanning the WalSnd entries.

I considered it that but it depends on performance.
Current patch avoids O(N*M).

I am surprised by this statement. You would have O(N) by just
discarding the oldest LSN values while holding the spinlock of each
WAL sender. What SyncRepGetNNewestSyncRecPtr looks for are just the
newest apply, write and flush positions.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Michael Paquier (#24)

Re: Quorum commit for multiple synchronous replication.

On Tue, Oct 11, 2016 at 6:08 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Oct 11, 2016 at 4:18 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
You may want to add an assert in
SyncRepGetSyncStandbysPriority and SyncRepGetSyncStandbysQuorum to be
sure that they are getting called for the correct method.
+   /* Sort each array in descending order to get 'pos' newest element */
+   qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
There is no need to reorder things again and to use arrays, you can
choose the newest LSNs when scanning the WalSnd entries.
I considered it that but it depends on performance.
Current patch avoids O(N*M).
I am surprised by this statement. You would have O(N) by just
discarding the oldest LSN values while holding the spinlock of each
WAL sender. What SyncRepGetNNewestSyncRecPtr looks for are just the
newest apply, write and flush positions.

Bah, stupid. I just missed the point with 'pos'. Now I see the trick.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#23)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Tue, Oct 11, 2016 at 4:18 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Sep 28, 2016 at 5:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Sep 24, 2016 at 5:37 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I still vote for changing behaviour of existing syntax 'k (n1, n2)' to
quorum commit.
That is,
1. 'First k (n1, n2, n3)' means that the master server waits for ACKs
from k standby servers whose name appear earlier in the list.
2. 'Any k (n1, n2, n3)' means that the master server waits for ACKs
from any k listed standby servers.
3. 'n1, n2, n3' is the same as #1 with k=1.
4. '(n1, n2, n3)' is the same as #2 with k=1.

OK, so I have done a review of this patch keeping that in mind as
that's the consensus. I am still getting familiar with the code...

Thank you for reviewing!
-    transactions will wait until all the standby servers which are considered
+    transactions will wait until the multiple standby servers which
are considered
There is no real need to update this sentence.
+        <literal>FIRST</> means to control the standby servers with
+        different priorities. The synchronous standbys will be those
+        whose name appear earlier in this list, and that are both
+        currently connected and streaming data in real-time(as shown
+        by a state of <literal>streaming</> in the
+        <link linkend="monitoring-stats-views-table">
+        <literal>pg_stat_replication</></link> view). Other standby
+        servers appearing later in this list represent potential
+        synchronous standbys. If any of the current synchronous
+        standbys disconnects for whatever reason, it will be replaced
+        immediately with the next-highest-priority standby.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
It does not seem necessary to me to enter in this level of details:
The keyword FIRST, coupled with an integer number N, chooses the first
N higher-priority standbys and makes transaction commit when their WAL
records are received. For example <literal>FIRST 3 (s1, s2, s3, s4)</>
makes transaction commits wait until their WAL records are received by
the three high-priority standbys chosen from standby servers s1, s2,
s3 and s4.
Will fix.
+        <literal>ANY</> means to control all of standby servers with
+        same priority. The master sever will wait for receipt from
+        at least <replaceable class="parameter">num_sync</replaceable>
+        standbys, which is quorum commit in the literature. The all of
+        listed standbys are considered as candidate of quorum commit.
+        For example, a setting of <literal> ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving receipts from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.
Similarly, something like that...
The keyword ANY, coupled with an integer number N, chooses N standbys
in a set of standbys with the same, lowest, priority and makes
transaction commit when WAL records are received those N standbys. For
example ANY 3(s1, s2, s3, s4) makes transaction commits wait until WAL
records have been received from 3 servers in the set s1, s2, s3 and
s4.
Will fix.

It could be good also to mention that no keyword specified means ANY,
which is incompatible with 9.6. The docs also miss the fact that if a
simple list of servers is given, without parenthesis and keywords,
this is equivalent to FIRST 1.

Right. I will add those documentations.
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'First 2 (s1, s2, s3)'
Nit here. It may be a good idea to just use upper-case characters in
the docs, or just lower-case for consistency, but not mix both.
Usually GUCs use lower-case characters.
Agree. Will fix.

+ when standby is considered as a condidate of quorum commit.</entry>
s/condidate/candidate/

Will fix.
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
Hm... Is that actually a good idea? Now "NODE" and "node" are two
different things for application_name, but with this patch both would
have the same meaning. I am getting to think that we could just use
the lower-case characters for the keywords any/first. Is this -i
switch a problem for elements in standby_list?
The string of standby name is not changed actually, only the parser
doesn't distinguish between "NODE" and "node".
The values used for checking application_name will still works fine.
If we want to name "first" or "any" as the standby name then it should
be double quoted.
+ * Calculate the 'pos' newest Write, Flush and Apply positions among
sync standbys.
I don't understand this comment.
+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       got_recptr = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
+                                            &applyPtr, &am_sync);
+   else /* SYNC_REP_QUORUM */
+       got_recptr = SyncRepGetNNewestSyncRecPtr(&writePtr, &flushPtr,
+                                             &applyPtr,
SyncRepConfig->num_sync,
+                                             &am_sync);
Those could be grouped together, there is no need to have pos as an argument.
Will fix.
+   /* In quroum method, all sync standby priorities are always 1 */
+   if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+       priority = 1;
This is dead code, SyncRepGetSyncStandbysPriority is not called for
QUORUM.
Well, this code is in SyncRepGetStandbyPriority which is called by
SyncRepInitConifig.
SyncRepGetStandbyPriority can be called regardless of the the
synchronization method.
You may want to add an assert in
SyncRepGetSyncStandbysPriority and SyncRepGetSyncStandbysQuorum to be
sure that they are getting called for the correct method.
+   /* Sort each array in descending order to get 'pos' newest element */
+   qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+   qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
There is no need to reorder things again and to use arrays, you can
choose the newest LSNs when scanning the WalSnd entries.
I considered it that but it depends on performance.
Current patch avoids O(N*M).

Attached latest patch.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_quorum_commit_v4.patchbinary/octet-stream; name=000_quorum_commit_v4.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e826c19..c2a76de 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3028,42 +3028,75 @@ include_dir 'conf.d'
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
         The synchronous standbys will be those whose names appear
-        earlier in this list, and
-        that are both currently connected and streaming data in real-time
-        (as shown by a state of <literal>streaming</literal> in the
+        in this list, and
+        that are both currently connected and streaming data in
+        real-time(as shown by a state of <literal>streaming</> in the
         <link linkend="monitoring-stats-views-table">
-        <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
+        <literal>pg_stat_replication</></link> view). If the keyword
+        <literal>FIRST</> is specified, other standby servers appearing
+        later in this list represent potential synchronous standbys.
+        If any of the current synchronous standbys disconnects for
+        whatever reason, it will be replaced immediately with the
+        next-highest-priority standby.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method of
+        that how master server controls the standby servers.
         </para>
         <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        The keyword <literal>FIRST</>, coupled with an interger
+        number N higher-priority standbys and makes transaction commit
+        when their WAL records are received.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        </para>
+        <para>
+        The keyword <literal>ANY</>, coupeld with an interger number N,
+        chooses N standbys in a set of standbys with the same, lowest,
+        priority and makes transaction commit when WAL records are received
+        those N standbys. For example, a setting of <literal> ANY 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until receiving receipts from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.
+        </para>
+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive word
+        and the standby name having these words are must be double-quoted.
+        </para>
+        <para>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
-       </para>
+        with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+        equal to 1. For example, <literal>1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
+        </para>
+        <note>
+         <para>
+         The keyword <literal>ANY</> is omissible, but note that there is
+         not compatibility between <productname>PostgreSQL</> version 10 and
+         9.6 or before. For example, <literal>1 (s1, s2)</> is the same as the
+         configuration with <literal>FIRST</> and <replaceable class="parameter">
+         num_sync</replaceable> equal to 1 in <productname>PostgreSQL</> 9.6
+         or before.  On the other hand, It's the same as the configuration with
+         <literal>ANY</> and <replaceable class="parameter">num_sync</> equal to
+         1 in <productname>PostgreSQL</> 10 or later.
+        </para>
+       </note>
        <para>
         The name of a standby server for this purpose is the
         <varname>application_name</> setting of the standby, as set in the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 06f49db..85a969d 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1161,6 +1161,18 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     <literal>s2</> fails. <literal>s4</> is an asynchronous standby since
     its name is not in the list.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys <literal>s1</>,
+    <literal>s2</> and <literal>s3</> wil be considered as synchronous standby
+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3de489e..361dd2d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1389,7 +1389,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <row>
      <entry><structfield>sync_state</></entry>
      <entry><type>text</></entry>
-     <entry>Synchronous state of this standby server</entry>
+     <entry>Synchronous state of this standby server. <literal>quorum</>
+      when standby is considered as a condidate of quorum commit.</entry>
     </row>
    </tbody>
    </tgroup>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..b2c4ad3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -77,20 +77,21 @@ char	   *SyncRepStandbyNames;
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
-static bool announce_next_takeover = true;
+SyncRepConfigData *SyncRepConfig = NULL;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
+static bool announce_next_takeover = true;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+								 XLogRecPtr *flushPtr,
+								 XLogRecPtr *applyPtr,
+								 bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +387,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -413,11 +414,10 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +435,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -471,17 +471,21 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Calculate the Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr, bool *am_sync)
 {
 	List	   *sync_standbys;
@@ -507,29 +511,74 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 		return false;
 	}
 
-	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
-	 */
-	foreach(cell, sync_standbys)
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+	{
+		/*
+		 * Scan through all sync standbys and calculate the oldest
+		 * Write, Flush and Apply positions.
+		 */
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+			XLogRecPtr	write;
+			XLogRecPtr	flush;
+			XLogRecPtr	apply;
+
+			SpinLockAcquire(&walsnd->mutex);
+			write = walsnd->write;
+			flush = walsnd->flush;
+			apply = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+				*writePtr = write;
+			if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+				*flushPtr = flush;
+			if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+				*applyPtr = apply;
+		}
+	}
+	else /* SYNC_REP_QUORUM */
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
-		XLogRecPtr	write;
-		XLogRecPtr	flush;
-		XLogRecPtr	apply;
-
-		SpinLockAcquire(&walsnd->mutex);
-		write = walsnd->write;
-		flush = walsnd->flush;
-		apply = walsnd->apply;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
-			*writePtr = write;
-		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
-			*flushPtr = flush;
-		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
-			*applyPtr = apply;
+		XLogRecPtr	*write_array;
+		XLogRecPtr	*flush_array;
+		XLogRecPtr	*apply_array;
+		int len;
+		int i = 0;
+
+		len = list_length(sync_standbys);
+		write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+			SpinLockAcquire(&walsnd->mutex);
+			write_array[i] = walsnd->write;
+			flush_array[i]= walsnd->flush;
+			apply_array[i] = walsnd->flush;
+			SpinLockRelease(&walsnd->mutex);
+
+			i++;
+		}
+
+		qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+		/*
+		 * Get N-th newest Write, Flush, Apply positions
+		 * specified by SyncRepConfig->num_sync.
+		 */
+		*writePtr = write_array[SyncRepConfig->num_sync - 1];
+		*flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+		*applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+		pfree(write_array);
+		pfree(flush_array);
+		pfree(apply_array);
 	}
 
 	list_free(sync_standbys);
@@ -537,17 +586,90 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected. The caller must hold SyncRepLock.
  *
- * If there are multiple standbys with the same priority,
- * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else /* SYNC_REP_QUORUM */
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+}
+
+/*
+ * Return the list of sync standbys using quorum method, or
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 List *
-SyncRepGetSyncStandbys(bool *am_sync)
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
+ * the first one found is selected perferentially.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -560,13 +682,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
+	Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
 
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
@@ -749,6 +865,10 @@ SyncRepGetStandbyPriority(void)
 		standby_name += strlen(standby_name) + 1;
 	}
 
+	/* In quroum method, all sync standby priorities are always 1 */
+	if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		priority = 1;
+
 	return (found ? priority : 0);
 }
 
@@ -892,6 +1012,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +79,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..403fd7d 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,14 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
+{first_ident}	{
+				yylval.str = pstrdup(yytext);
+				return FIRST;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0f3ced2..7eafe42 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2865,7 +2865,8 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..1b675ee 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int			sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
@@ -68,6 +75,8 @@ extern void SyncRepReleaseWaiters(void);
 
 /* called by wal sender and user backend */
 extern List *SyncRepGetSyncStandbys(bool *am_sync);
+extern List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+extern List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
 
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..63cd88c 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 10;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
 	$node_master, qq(standby2|2|sync
 standby3|3|sync),
 	'2 synchronous standbys',
-	'2(standby1,standby2,standby3)');
+	'First 2(standby1,standby2,standby3)');
 
 # Start standby1
 $node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
 standby3|3|sync
 standby4|1|sync),
 	'num_sync exceeds the num of potential sync standbys',
-	'6(standby4,standby0,standby3,standby2)');
+	'First 6(standby4,standby0,standby3,standby2)');
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
 	'asterisk comes before another standby name',
-	'2(standby1,*,standby2)');
+	'First 2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
 # earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
 standby3|1|sync
 standby4|1|potential),
 	'multiple standbys having the same priority are chosen as sync',
-	'2(*)');
+	'First 2(*)');
 
 # Stop Standby3 which is considered in 'sync' state.
 $node_standby_3->stop;
@@ -172,3 +172,25 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check that the state of standbys listed as a voter are having
+# same priority when synchronous_standby_names uses quorum method.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby4|0|async),
+'2 quorum and 1 async',
+'Any 2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'Any 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+'all standbys are considered as condidates for quorum commit',
+'Any 2(*)');

#27

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#26)

Re: Quorum commit for multiple synchronous replication.

On Mon, Oct 17, 2016 at 4:00 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest patch.
Please review it.

Okay, so let's move on with this patch...

+         <para>
+         The keyword <literal>ANY</> is omissible, but note that there is
+         not compatibility between <productname>PostgreSQL</> version 10 and
+         9.6 or before. For example, <literal>1 (s1, s2)</> is the same as the
+         configuration with <literal>FIRST</> and <replaceable
class="parameter">
+         num_sync</replaceable> equal to 1 in <productname>PostgreSQL</> 9.6
+         or before.  On the other hand, It's the same as the configuration with
+         <literal>ANY</> and <replaceable
class="parameter">num_sync</> equal to
+         1 in <productname>PostgreSQL</> 10 or later.
+        </para>
This paragraph could be reworded:
If FIRST or ANY are not specified, this parameter behaves as ANY. Note
that this grammar is incompatible with PostgreSQL 9.6, where no
keyword specified is equivalent as if FIRST was specified.
In short, there is no real need to specify num_sync as this behavior
does not have changed, as well as it is not necessary to mention
pre-9.6 versions as the multi-sync grammar has been added in 9.6.

- Specifying more than one standby name can allow very high availability.
Why removing this sentence?

+ The keyword <literal>ANY</>, coupeld with an interger number N,
s/coupeld/coupled/ and s/interger/integer/, for a double hit in one
line, still...

+        The keyword <literal>ANY</>, coupeld with an interger number N,
+        chooses N standbys in a set of standbys with the same, lowest,
+        priority and makes transaction commit when WAL records are received
+        those N standbys.
This could be reworded more simply, for example: The keyword ANY,
coupled with an integer number N, makes transaction commits wait until
WAL records are received from N connected standbys among those defined
in the list of synchronous_standby_names.

+ <literal>s2</> and <literal>s3</> wil be considered as synchronous standby
s/wil/will/

+ when standby is considered as a condidate of quorum commit.</entry>
s/condidate/candidate/

[... stopping here ...] Please be more careful with the documentation
and comment grammar. There are other things in the patch..

A bunch of comments at the top of syncrep.c need to be updated.

+extern List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+extern List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
Those two should be static routines in syncrep.c, let's keep the whole
logic about quorum and higher-priority definition only there and not
bother the callers of them about that.

+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       return SyncRepGetSyncStandbysPriority(am_sync);
+   else /* SYNC_REP_QUORUM */
+       return SyncRepGetSyncStandbysQuorum(am_sync);
Both routines share the same logic to detect if a WAL sender can be
selected as a candidate for sync evaluation or not, still per the
selection they do I agree that it is better to keep them as separate.

+   /* In quroum method, all sync standby priorities are always 1 */
+   if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+       priority = 1;
Honestly I don't understand why you are enforcing that. Priority can
be important for users willing to switch from ANY to FIRST to have a
look immediately at what are the standbys that would become sync or
potential.

            else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
The comment at the top of this code block needs to be refreshed.

If FIRST N is used, is it easy for the user to understand what are the
nodes in sync:
=# alter system set synchronous_standby_names = 'FIRST 2 (node_5433,
node_5434, node_5435)';
ALTER SYSTEM
=# select pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
=# select application_name, sync_priority, sync_state from pg_stat_replication ;
application_name | sync_priority | sync_state
------------------+---------------+------------
node_5433 | 1 | sync
node_5434 | 2 | sync
node_5435 | 3 | potential
node_5436 | 0 | async
(4 rows)

In this case it is easy to understand that two nodes are required to be in sync.

It is not possible to guess from how many standbys this needs to wait
for. One idea would be to mark the sync_state not as "quorum", but
"quorum-N", or just add a new column to indicate how many in the set
need to give a commit confirmation.

The patch is going into the right direction in my opinion.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Michael Paquier (#27)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Tue, Oct 25, 2016 at 10:35 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Oct 17, 2016 at 4:00 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest patch.
Please review it.

Okay, so let's move on with this patch...

Thank you for reviewing this patch.

+         <para>
+         The keyword <literal>ANY</> is omissible, but note that there is
+         not compatibility between <productname>PostgreSQL</> version 10 and
+         9.6 or before. For example, <literal>1 (s1, s2)</> is the same as the
+         configuration with <literal>FIRST</> and <replaceable
class="parameter">
+         num_sync</replaceable> equal to 1 in <productname>PostgreSQL</> 9.6
+         or before.  On the other hand, It's the same as the configuration with
+         <literal>ANY</> and <replaceable
class="parameter">num_sync</> equal to
+         1 in <productname>PostgreSQL</> 10 or later.
+        </para>
This paragraph could be reworded:
If FIRST or ANY are not specified, this parameter behaves as ANY. Note
that this grammar is incompatible with PostgreSQL 9.6, where no
keyword specified is equivalent as if FIRST was specified.
In short, there is no real need to specify num_sync as this behavior
does not have changed, as well as it is not necessary to mention
pre-9.6 versions as the multi-sync grammar has been added in 9.6.

Fixed.

- Specifying more than one standby name can allow very high availability.
Why removing this sentence?

+ The keyword <literal>ANY</>, coupeld with an interger number N,
s/coupeld/coupled/ and s/interger/integer/, for a double hit in one
line, still...
+        The keyword <literal>ANY</>, coupeld with an interger number N,
+        chooses N standbys in a set of standbys with the same, lowest,
+        priority and makes transaction commit when WAL records are received
+        those N standbys.
This could be reworded more simply, for example: The keyword ANY,
coupled with an integer number N, makes transaction commits wait until
WAL records are received from N connected standbys among those defined
in the list of synchronous_standby_names.
+ <literal>s2</> and <literal>s3</> wil be considered as synchronous standby
s/wil/will/

+ when standby is considered as a condidate of quorum commit.</entry>
s/condidate/candidate/

[... stopping here ...] Please be more careful with the documentation
and comment grammar. There are other things in the patch..

I fix some typo as much as I found.

A bunch of comments at the top of syncrep.c need to be updated.

+extern List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+extern List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
Those two should be static routines in syncrep.c, let's keep the whole
logic about quorum and higher-priority definition only there and not
bother the callers of them about that.

Fixed.

+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       return SyncRepGetSyncStandbysPriority(am_sync);
+   else /* SYNC_REP_QUORUM */
+       return SyncRepGetSyncStandbysQuorum(am_sync);
Both routines share the same logic to detect if a WAL sender can be
selected as a candidate for sync evaluation or not, still per the
selection they do I agree that it is better to keep them as separate.

+   /* In quroum method, all sync standby priorities are always 1 */
+   if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+       priority = 1;
Honestly I don't understand why you are enforcing that. Priority can
be important for users willing to switch from ANY to FIRST to have a
look immediately at what are the standbys that would become sync or
potential.

I thought that since all standbys appearing in s_s_names list are
treated equally in quorum method, these standbys should have same
priority.
If these standby have different sync_priority, it looks like that
master server replicates to standby server based on priority.

else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
The comment at the top of this code block needs to be refreshed.

Fixed.

The representation given to the user in pg_stat_replication is not
enough IMO. For example, imagine a cluster with 4 standbys:
=# select application_name, sync_priority, sync_state from pg_stat_replication ;
application_name | sync_priority | sync_state
------------------+---------------+------------
node_5433 | 0 | async
node_5434 | 0 | async
node_5435 | 0 | async
node_5436 | 0 | async
(4 rows)

If FIRST N is used, is it easy for the user to understand what are the
nodes in sync:
=# alter system set synchronous_standby_names = 'FIRST 2 (node_5433,
node_5434, node_5435)';
ALTER SYSTEM
=# select pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
=# select application_name, sync_priority, sync_state from pg_stat_replication ;
application_name | sync_priority | sync_state
------------------+---------------+------------
node_5433 | 1 | sync
node_5434 | 2 | sync
node_5435 | 3 | potential
node_5436 | 0 | async
(4 rows)

In this case it is easy to understand that two nodes are required to be in sync.

When using ANY similarly for three nodes, here is what
pg_stat_replication tells:
=# select application_name, sync_priority, sync_state from pg_stat_replication ;
application_name | sync_priority | sync_state
------------------+---------------+------------
node_5433 | 1 | quorum
node_5434 | 1 | quorum
node_5435 | 1 | quorum
node_5436 | 0 | async
(4 rows)

It is not possible to guess from how many standbys this needs to wait
for. One idea would be to mark the sync_state not as "quorum", but
"quorum-N", or just add a new column to indicate how many in the set
need to give a commit confirmation.

As Simon suggested before, we could support another feature that
allows the client to control the quorum number.
Considering adding that feature, I thought it's better to have and
control that information as a GUC parameter.
Thought?

Attached latest v5 patch.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_quorum_commit_v5.patchtext/x-diff; charset=US-ASCII; name=000_quorum_commit_v5.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..8078cda 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3028,42 +3028,76 @@ include_dir 'conf.d'
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
         The synchronous standbys will be those whose names appear
-        earlier in this list, and
+        in this list, and
         that are both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
-        <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
+        <literal>pg_stat_replication</></link> view). If the keyword
+        <literal>FIRST</> is specified, other standby servers appearing
+        later in this list represent potential synchronous standbys.
+        If any of the current synchronous standbys disconnects for
+        whatever reason, it will be replaced immediately with the
+        next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method of
+        that how master server controls the standby servers.
         </para>
         <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        The keyword <literal>FIRST</>, coupled with an integer
+        number N higher-priority standbys and makes transaction commit
+        when their WAL records are received.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        </para>
+        <para>
+        The keyword <literal>ANY</>, coupled with an integer number N,
+        makes transaction commits wait until WAL records are received
+        from N connected standbys among those defined in the list of
+        <varname>synchronous_standby_names</>. For example, a setting
+        of <literal>ANY 3 (s1, s2, s3, s4)</> makes transaction commits
+        wait until receiving receipts from at least any three standbys
+        of four listed servers <literal>s1</>, <literal>s2</>, <literal>s3</>,
+        <literal>s4</>.
+        </para>
+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive word
+        and the standby name having these words are must be double-quoted.
+        </para>
+        <para>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
-       </para>
+        with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+        equal to 1. For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
+        </para>
+        <note>
+         <para>
+         If <literal>FIRST</> or <literal>ANY</> are not specified, this parameter
+         behaves as <literal>ANY</>. Note that this grammar is incompatible with
+         <productname>PostgresSQL</> 9.6, where no keyword specified is equivalent
+         as if <literal>FIRST</> was specified. In short, there is no real need to
+         specify <replaceable class="parameter">num_sync</replaceable> as this
+         behavior does not have changed, as well as it is not necessary to mention
+         pre-9.6 versions are the multi-sync grammar has been added in 9.6.
+        </para>
+       </note>
        <para>
         The name of a standby server for this purpose is the
         <varname>application_name</> setting of the standby, as set in the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 5bedaf2..7a0a22a 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1161,6 +1161,18 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     <literal>s2</> fails. <literal>s4</> is an asynchronous standby since
     its name is not in the list.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys <literal>s1</>,
+    <literal>s2</> and <literal>s3</> will be considered as synchronous standby
+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3de489e..b44dadb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1389,7 +1389,10 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <row>
      <entry><structfield>sync_state</></entry>
      <entry><type>text</></entry>
-     <entry>Synchronous state of this standby server</entry>
+     <entry>Synchronous state of this standby server. <literal>quorum-N</>
+     , where N is the number of synchronous standbys that transactions
+     need to wait for replies from, when standby is considered as a
+     candidate of quorum commit.</entry>
     </row>
    </tbody>
    </tgroup>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..74093cd 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -31,16 +31,19 @@
  *
  * In 9.5 or before only a single standby could be considered as
  * synchronous. In 9.6 we support multiple synchronous standbys.
- * The number of synchronous standbys that transactions must wait for
- * replies from is specified in synchronous_standby_names.
- * This parameter also specifies a list of standby names,
- * which determines the priority of each standby for being chosen as
- * a synchronous standby. The standbys whose names appear earlier
- * in the list are given higher priority and will be considered as
- * synchronous. Other standby servers appearing later in this list
- * represent potential synchronous standbys. If any of the current
- * synchronous standbys disconnects for whatever reason, it will be
- * replaced immediately with the next-highest-priority standby.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
  *
  * Before the standbys chosen from synchronous_standby_names can
  * become the synchronous standbys they must have caught up with
@@ -73,24 +76,27 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+SyncRepConfigData *SyncRepConfig = NULL;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
 static bool announce_next_takeover = true;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+								 XLogRecPtr *flushPtr,
+								 XLogRecPtr *applyPtr,
+								 bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +392,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -413,11 +419,10 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +440,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -471,17 +476,45 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected. The caller must hold SyncRepLock.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else /* SYNC_REP_QUORUM */
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+}
+
+/*
+ * Calculate the Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr, bool *am_sync)
 {
 	List	   *sync_standbys;
@@ -507,29 +540,74 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 		return false;
 	}
 
-	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
-	 */
-	foreach(cell, sync_standbys)
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
-		XLogRecPtr	write;
-		XLogRecPtr	flush;
-		XLogRecPtr	apply;
-
-		SpinLockAcquire(&walsnd->mutex);
-		write = walsnd->write;
-		flush = walsnd->flush;
-		apply = walsnd->apply;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
-			*writePtr = write;
-		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
-			*flushPtr = flush;
-		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
-			*applyPtr = apply;
+		/*
+		 * Scan through all sync standbys and calculate the oldest
+		 * Write, Flush and Apply positions.
+		 */
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+			XLogRecPtr	write;
+			XLogRecPtr	flush;
+			XLogRecPtr	apply;
+
+			SpinLockAcquire(&walsnd->mutex);
+			write = walsnd->write;
+			flush = walsnd->flush;
+			apply = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+				*writePtr = write;
+			if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+				*flushPtr = flush;
+			if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+				*applyPtr = apply;
+		}
+	}
+	else /* SYNC_REP_QUORUM */
+	{
+		XLogRecPtr	*write_array;
+		XLogRecPtr	*flush_array;
+		XLogRecPtr	*apply_array;
+		int len;
+		int i = 0;
+
+		len = list_length(sync_standbys);
+		write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+			SpinLockAcquire(&walsnd->mutex);
+			write_array[i] = walsnd->write;
+			flush_array[i]= walsnd->flush;
+			apply_array[i] = walsnd->flush;
+			SpinLockRelease(&walsnd->mutex);
+
+			i++;
+		}
+
+		qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+		/*
+		 * Get N-th newest Write, Flush, Apply positions
+		 * specified by SyncRepConfig->num_sync.
+		 */
+		*writePtr = write_array[SyncRepConfig->num_sync - 1];
+		*flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+		*applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+		pfree(write_array);
+		pfree(flush_array);
+		pfree(apply_array);
 	}
 
 	list_free(sync_standbys);
@@ -537,17 +615,66 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using quorum method, or
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
  *
- * If there are multiple standbys with the same priority,
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
  * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
-List *
-SyncRepGetSyncStandbys(bool *am_sync)
+static List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -560,13 +687,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
+	Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
 
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
@@ -749,6 +870,10 @@ SyncRepGetStandbyPriority(void)
 		standby_name += strlen(standby_name) + 1;
 	}
 
+	/* In quorum method, all sync standby priorities are always 1 */
+	if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		priority = 1;
+
 	return (found ? priority : 0);
 }
 
@@ -892,6 +1017,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +79,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..403fd7d 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,14 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
+{first_ident}	{
+				yylval.str = pstrdup(yytext);
+				return FIRST;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index bc5e508..ecfbd78 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2860,12 +2860,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 			/*
 			 * More easily understood version of standby state. This is purely
-			 * informational, not different from priority.
+			 * informational. In quorum method, we add the number to indicate
+			 * how many in the set need to give a commit confirmation.
 			 */
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..8dd74a3 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int			sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..c6af72f 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 10;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
 	$node_master, qq(standby2|2|sync
 standby3|3|sync),
 	'2 synchronous standbys',
-	'2(standby1,standby2,standby3)');
+	'First 2(standby1,standby2,standby3)');
 
 # Start standby1
 $node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
 standby3|3|sync
 standby4|1|sync),
 	'num_sync exceeds the num of potential sync standbys',
-	'6(standby4,standby0,standby3,standby2)');
+	'First 6(standby4,standby0,standby3,standby2)');
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
 	'asterisk comes before another standby name',
-	'2(standby1,*,standby2)');
+	'First 2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
 # earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
 standby3|1|sync
 standby4|1|potential),
 	'multiple standbys having the same priority are chosen as sync',
-	'2(*)');
+	'First 2(*)');
 
 # Stop Standby3 which is considered in 'sync' state.
 $node_standby_3->stop;
@@ -172,3 +172,25 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check that the state of standbys listed as a voter are having
+# same priority when synchronous_standby_names uses quorum method.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby4|0|async),
+'2 quorum and 1 async',
+'Any 2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'Any 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+'all standbys are considered as candidates for quorum commit',
+'Any 2(*)');

#29

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#28)

Re: Quorum commit for multiple synchronous replication.

On Tue, Nov 8, 2016 at 12:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Oct 25, 2016 at 10:35 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       return SyncRepGetSyncStandbysPriority(am_sync);
+   else /* SYNC_REP_QUORUM */
+       return SyncRepGetSyncStandbysQuorum(am_sync);
Both routines share the same logic to detect if a WAL sender can be
selected as a candidate for sync evaluation or not, still per the
selection they do I agree that it is better to keep them as separate.
+   /* In quroum method, all sync standby priorities are always 1 */
+   if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+       priority = 1;
Honestly I don't understand why you are enforcing that. Priority can
be important for users willing to switch from ANY to FIRST to have a
look immediately at what are the standbys that would become sync or
potential.
I thought that since all standbys appearing in s_s_names list are
treated equally in quorum method, these standbys should have same
priority.
If these standby have different sync_priority, it looks like that
master server replicates to standby server based on priority.

No actually, because we know that they are a quorum set, and that they
work in the same set. The concept of priorities has no real meaning
for quorum as there is no ordering of the elements. Another, perhaps
cleaner idea may be to mark the field as NULL actually.

It is not possible to guess from how many standbys this needs to wait
for. One idea would be to mark the sync_state not as "quorum", but
"quorum-N", or just add a new column to indicate how many in the set
need to give a commit confirmation.

As Simon suggested before, we could support another feature that
allows the client to control the quorum number.
Considering adding that feature, I thought it's better to have and
control that information as a GUC parameter.
Thought?

Similarly that would be a SIGHUP parameter? Why not. Perhaps my worry
is not that much legitimate, users could just look at s_s_names to
guess how many in hte set a commit needs to wait for.

+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive word
+        and the standby name having these words are must be double-quoted.
+        </para>
s/word/words/.

+        <literal>FIRST</> and <literal>ANY</> specify the method of
+        that how master server controls the standby servers.
A little bit hard to understand, I would suggest:
FIRST and ANY specify the method used by the master to control the
standby servers.

+        The keyword <literal>FIRST</>, coupled with an integer
+        number N higher-priority standbys and makes transaction commit
+        when their WAL records are received.
This is unclear to me. Here is a correction:
The keyword FIRST, coupled with an integer N, makes transaction commit
wait until WAL records are received fron the N standbys with higher
priority number.

+        <varname>synchronous_standby_names</>. For example, a setting
+        of <literal>ANY 3 (s1, s2, s3, s4)</> makes transaction commits
+        wait until receiving receipts from at least any three standbys
+        of four listed servers <literal>s1</>, <literal>s2</>, <literal>s3</>,
This could just mention WAL records instead of "receipts".

Instead of saying "an integer number N", we could use <literal>num_sync</>.

+         If <literal>FIRST</> or <literal>ANY</> are not specified,
this parameter
+         behaves as <literal>ANY</>. Note that this grammar is
incompatible with
+         <productname>PostgresSQL</> 9.6, where no keyword specified
is equivalent
+         as if <literal>FIRST</> was specified. In short, there is no
real need to
+         specify <replaceable class="parameter">num_sync</replaceable> as this
+         behavior does not have changed, as well as it is not
necessary to mention
+         pre-9.6 versions are the multi-sync grammar has been added in 9.6.
This paragraph could be reworked, say:
if FIRST or ANY are not specified this parameter behaves as if ANY is
used. Note that this grammar is incompatible with PostgreSQL 9.6 which
is the first version supporting multiple standbys with synchronous
replication, where no such keyword FIRST or ANY can be used. Note that
the grammar behaves as if FIRST is used, which is incompatible with
the post-9.6 behavior.

+     <entry>Synchronous state of this standby server. <literal>quorum-N</>
+     , where N is the number of synchronous standbys that transactions
+     need to wait for replies from, when standby is considered as a
+     candidate of quorum commit.</entry>
Nitpicking: I think that the comma goes to the previous line if it is
the first character of a line.

+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       return SyncRepGetSyncStandbysPriority(am_sync);
+   else /* SYNC_REP_QUORUM */
+       return SyncRepGetSyncStandbysQuorum(am_sync)
Or that?
if (PRIORITY)
    return StandbysPriority();
else if (QUORUM)
    return StandbysQuorum();
else
    elog(ERROR, "Boom");

+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
Last sentence is grammatically incorrect, and it would be more correct
to precise the Nth LSN positions to be able to select k standbys from
a set of n ones.

+           SpinLockAcquire(&walsnd->mutex);
+           write_array[i] = walsnd->write;
+           flush_array[i]= walsnd->flush;
+           apply_array[i] = walsnd->flush;
+           SpinLockRelease(&walsnd->mutex);
A nit: adding a space on the self of the second = character. And you
need to save the apply position of the WAL sender, not the flush
position in the array that is going to be ordered.

            /*
             * More easily understood version of standby state. This is purely
-            * informational, not different from priority.
+            * informational. In quorum method, we add the number to indicate
+            * how many in the set need to give a commit confirmation.
             */
            if (priority == 0)
                values[7] = CStringGetTextDatum("async");
            else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum")
This code block and its explanation comments tell two different
stories. The comment is saying that something like "quorum-N" is used
but the code always prints "quorum".

It may be a good idea in the test to check that when no keywords is
specified the group of standbys is in quorum mode.

The code looks in good shape, I am still willing to run more advanced
tests manually.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Michael Paquier (#29)

Re: Quorum commit for multiple synchronous replication.

On Tue, Nov 8, 2016 at 10:12 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 8, 2016 at 12:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Oct 25, 2016 at 10:35 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       return SyncRepGetSyncStandbysPriority(am_sync);
+   else /* SYNC_REP_QUORUM */
+       return SyncRepGetSyncStandbysQuorum(am_sync);
Both routines share the same logic to detect if a WAL sender can be
selected as a candidate for sync evaluation or not, still per the
selection they do I agree that it is better to keep them as separate.
+   /* In quroum method, all sync standby priorities are always 1 */
+   if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+       priority = 1;
Honestly I don't understand why you are enforcing that. Priority can
be important for users willing to switch from ANY to FIRST to have a
look immediately at what are the standbys that would become sync or
potential.
I thought that since all standbys appearing in s_s_names list are
treated equally in quorum method, these standbys should have same
priority.
If these standby have different sync_priority, it looks like that
master server replicates to standby server based on priority.
No actually, because we know that they are a quorum set, and that they
work in the same set. The concept of priorities has no real meaning
for quorum as there is no ordering of the elements. Another, perhaps
cleaner idea may be to mark the field as NULL actually.

We know that but I'm concerned it might confuse the user.
If these priorities are the same, it can obviously imply that all of
the standby listed in s_s_names are handled equally.

It is not possible to guess from how many standbys this needs to wait
for. One idea would be to mark the sync_state not as "quorum", but
"quorum-N", or just add a new column to indicate how many in the set
need to give a commit confirmation.

As Simon suggested before, we could support another feature that
allows the client to control the quorum number.
Considering adding that feature, I thought it's better to have and
control that information as a GUC parameter.
Thought?

Similarly that would be a SIGHUP parameter? Why not. Perhaps my worry
is not that much legitimate, users could just look at s_s_names to
guess how many in hte set a commit needs to wait for.

It would be PGC_USRSET similar to synchronous_commit. The user can
specify it in statement level.

+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive word
+        and the standby name having these words are must be double-quoted.
+        </para>
s/word/words/.

+        <literal>FIRST</> and <literal>ANY</> specify the method of
+        that how master server controls the standby servers.
A little bit hard to understand, I would suggest:
FIRST and ANY specify the method used by the master to control the
standby servers.

+        The keyword <literal>FIRST</>, coupled with an integer
+        number N higher-priority standbys and makes transaction commit
+        when their WAL records are received.
This is unclear to me. Here is a correction:
The keyword FIRST, coupled with an integer N, makes transaction commit
wait until WAL records are received fron the N standbys with higher
priority number.

+        <varname>synchronous_standby_names</>. For example, a setting
+        of <literal>ANY 3 (s1, s2, s3, s4)</> makes transaction commits
+        wait until receiving receipts from at least any three standbys
+        of four listed servers <literal>s1</>, <literal>s2</>, <literal>s3</>,
This could just mention WAL records instead of "receipts".

Instead of saying "an integer number N", we could use <literal>num_sync</>.

+         If <literal>FIRST</> or <literal>ANY</> are not specified,
this parameter
+         behaves as <literal>ANY</>. Note that this grammar is
incompatible with
+         <productname>PostgresSQL</> 9.6, where no keyword specified
is equivalent
+         as if <literal>FIRST</> was specified. In short, there is no
real need to
+         specify <replaceable class="parameter">num_sync</replaceable> as this
+         behavior does not have changed, as well as it is not
necessary to mention
+         pre-9.6 versions are the multi-sync grammar has been added in 9.6.
This paragraph could be reworked, say:
if FIRST or ANY are not specified this parameter behaves as if ANY is
used. Note that this grammar is incompatible with PostgreSQL 9.6 which
is the first version supporting multiple standbys with synchronous
replication, where no such keyword FIRST or ANY can be used. Note that
the grammar behaves as if FIRST is used, which is incompatible with
the post-9.6 behavior.

+     <entry>Synchronous state of this standby server. <literal>quorum-N</>
+     , where N is the number of synchronous standbys that transactions
+     need to wait for replies from, when standby is considered as a
+     candidate of quorum commit.</entry>
Nitpicking: I think that the comma goes to the previous line if it is
the first character of a line.

+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       return SyncRepGetSyncStandbysPriority(am_sync);
+   else /* SYNC_REP_QUORUM */
+       return SyncRepGetSyncStandbysQuorum(am_sync)
Or that?
if (PRIORITY)
return StandbysPriority();
else if (QUORUM)
return StandbysQuorum();
else
elog(ERROR, "Boom");

+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
Last sentence is grammatically incorrect, and it would be more correct
to precise the Nth LSN positions to be able to select k standbys from
a set of n ones.

+           SpinLockAcquire(&walsnd->mutex);
+           write_array[i] = walsnd->write;
+           flush_array[i]= walsnd->flush;
+           apply_array[i] = walsnd->flush;
+           SpinLockRelease(&walsnd->mutex);
A nit: adding a space on the self of the second = character. And you
need to save the apply position of the WAL sender, not the flush
position in the array that is going to be ordered.

/*
* More easily understood version of standby state. This is purely
-            * informational, not different from priority.
+            * informational. In quorum method, we add the number to indicate
+            * how many in the set need to give a commit confirmation.
*/
if (priority == 0)
values[7] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum")
This code block and its explanation comments tell two different
stories. The comment is saying that something like "quorum-N" is used
but the code always prints "quorum".

It may be a good idea in the test to check that when no keywords is
specified the group of standbys is in quorum mode.

Yeah, I will add some tests.

I will post new version patch incorporated other comments.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#30)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Mon, Nov 14, 2016 at 5:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Nov 8, 2016 at 10:12 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 8, 2016 at 12:25 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Oct 25, 2016 at 10:35 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       return SyncRepGetSyncStandbysPriority(am_sync);
+   else /* SYNC_REP_QUORUM */
+       return SyncRepGetSyncStandbysQuorum(am_sync);
Both routines share the same logic to detect if a WAL sender can be
selected as a candidate for sync evaluation or not, still per the
selection they do I agree that it is better to keep them as separate.
+   /* In quroum method, all sync standby priorities are always 1 */
+   if (found && SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+       priority = 1;
Honestly I don't understand why you are enforcing that. Priority can
be important for users willing to switch from ANY to FIRST to have a
look immediately at what are the standbys that would become sync or
potential.
I thought that since all standbys appearing in s_s_names list are
treated equally in quorum method, these standbys should have same
priority.
If these standby have different sync_priority, it looks like that
master server replicates to standby server based on priority.
No actually, because we know that they are a quorum set, and that they
work in the same set. The concept of priorities has no real meaning
for quorum as there is no ordering of the elements. Another, perhaps
cleaner idea may be to mark the field as NULL actually.

We know that but I'm concerned it might confuse the user.
If these priorities are the same, it can obviously imply that all of
the standby listed in s_s_names are handled equally.

It is not possible to guess from how many standbys this needs to wait
for. One idea would be to mark the sync_state not as "quorum", but
"quorum-N", or just add a new column to indicate how many in the set
need to give a commit confirmation.

As Simon suggested before, we could support another feature that
allows the client to control the quorum number.
Considering adding that feature, I thought it's better to have and
control that information as a GUC parameter.
Thought?

Similarly that would be a SIGHUP parameter? Why not. Perhaps my worry
is not that much legitimate, users could just look at s_s_names to
guess how many in hte set a commit needs to wait for.

It would be PGC_USRSET similar to synchronous_commit. The user can
specify it in statement level.

+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive word
+        and the standby name having these words are must be double-quoted.
+        </para>
s/word/words/.

+        <literal>FIRST</> and <literal>ANY</> specify the method of
+        that how master server controls the standby servers.
A little bit hard to understand, I would suggest:
FIRST and ANY specify the method used by the master to control the
standby servers.

+        The keyword <literal>FIRST</>, coupled with an integer
+        number N higher-priority standbys and makes transaction commit
+        when their WAL records are received.
This is unclear to me. Here is a correction:
The keyword FIRST, coupled with an integer N, makes transaction commit
wait until WAL records are received fron the N standbys with higher
priority number.

+        <varname>synchronous_standby_names</>. For example, a setting
+        of <literal>ANY 3 (s1, s2, s3, s4)</> makes transaction commits
+        wait until receiving receipts from at least any three standbys
+        of four listed servers <literal>s1</>, <literal>s2</>, <literal>s3</>,
This could just mention WAL records instead of "receipts".

Instead of saying "an integer number N", we could use <literal>num_sync</>.

+         If <literal>FIRST</> or <literal>ANY</> are not specified,
this parameter
+         behaves as <literal>ANY</>. Note that this grammar is
incompatible with
+         <productname>PostgresSQL</> 9.6, where no keyword specified
is equivalent
+         as if <literal>FIRST</> was specified. In short, there is no
real need to
+         specify <replaceable class="parameter">num_sync</replaceable> as this
+         behavior does not have changed, as well as it is not
necessary to mention
+         pre-9.6 versions are the multi-sync grammar has been added in 9.6.
This paragraph could be reworked, say:
if FIRST or ANY are not specified this parameter behaves as if ANY is
used. Note that this grammar is incompatible with PostgreSQL 9.6 which
is the first version supporting multiple standbys with synchronous
replication, where no such keyword FIRST or ANY can be used. Note that
the grammar behaves as if FIRST is used, which is incompatible with
the post-9.6 behavior.

+     <entry>Synchronous state of this standby server. <literal>quorum-N</>
+     , where N is the number of synchronous standbys that transactions
+     need to wait for replies from, when standby is considered as a
+     candidate of quorum commit.</entry>
Nitpicking: I think that the comma goes to the previous line if it is
the first character of a line.

+   if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+       return SyncRepGetSyncStandbysPriority(am_sync);
+   else /* SYNC_REP_QUORUM */
+       return SyncRepGetSyncStandbysQuorum(am_sync)
Or that?
if (PRIORITY)
return StandbysPriority();
else if (QUORUM)
return StandbysQuorum();
else
elog(ERROR, "Boom");

+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
Last sentence is grammatically incorrect, and it would be more correct
to precise the Nth LSN positions to be able to select k standbys from
a set of n ones.

+           SpinLockAcquire(&walsnd->mutex);
+           write_array[i] = walsnd->write;
+           flush_array[i]= walsnd->flush;
+           apply_array[i] = walsnd->flush;
+           SpinLockRelease(&walsnd->mutex);
A nit: adding a space on the self of the second = character. And you
need to save the apply position of the WAL sender, not the flush
position in the array that is going to be ordered.

/*
* More easily understood version of standby state. This is purely
-            * informational, not different from priority.
+            * informational. In quorum method, we add the number to indicate
+            * how many in the set need to give a commit confirmation.
*/
if (priority == 0)
values[7] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum")
This code block and its explanation comments tell two different
stories. The comment is saying that something like "quorum-N" is used
but the code always prints "quorum".

It may be a good idea in the test to check that when no keywords is
specified the group of standbys is in quorum mode.

Yeah, I will add some tests.

I will post new version patch incorporated other comments.

Attached latest version patch incorporated review comments. After more
thought, I agree and changed the value of standby priority in quorum
method so that it's not set 1 forcibly. The all standby priorities are
1 If s_s_names = 'ANY(*)'.
Please review this patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_quorum_commit_v6.patchapplication/x-patch; name=000_quorum_commit_v6.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..e125dff 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3028,42 +3028,75 @@ include_dir 'conf.d'
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
         The synchronous standbys will be those whose names appear
-        earlier in this list, and
+        in this list, and
         that are both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
-        <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
+        <literal>pg_stat_replication</></link> view). If the keyword
+        <literal>FIRST</> is specified, other standby servers appearing
+        later in this list represent potential synchronous standbys.
+        If any of the current synchronous standbys disconnects for
+        whatever reason, it will be replaced immediately with the
+        next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
         </para>
         <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        The keyword <literal>FIRST</>, coupled with <literal>num_sync</>, makes
+        transaction commit wait until WAL records are received from the
+        <literal>num_sync</> standbys with higher priority number.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        </para>
+        <para>
+        The keyword <literal>ANY</>, coupled with <literal>num_sync</>,
+        makes transaction commits wait until WAL records are received
+        from at least <literal>num_sync</> connected standbys among those
+        defined in the list of <varname>synchronous_standby_names</>. For
+        example, a setting of <literal>ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving WAL records from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.
+        </para>
+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive words
+        and the standby name having these words are must be double-quoted.
+        </para>
+        <para>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
-       </para>
+        with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+        equal to 1. For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
+        </para>
+        <note>
+         <para>
+         If <literal>FIRST</> or <literal>ANY</> are not specified, this parameter
+         behaves as if <literal>ANY</> is used. Note that this grammar is incompatible
+         with <productname>PostgresSQL</> 9.6 which is first version supporting multiple
+         standbys with synchronous replication, where no such keyword <literal>FIRST</>
+         or <literal>ANY</> can be used. Note that the grammer behaves as if <literal>FIRST</>
+         is used, which is incompatible with the post-9.6 version behavior.
+        </para>
+       </note>
        <para>
         The name of a standby server for this purpose is the
         <varname>application_name</> setting of the standby, as set in the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 5bedaf2..7a0a22a 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1161,6 +1161,18 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     <literal>s2</> fails. <literal>s4</> is an asynchronous standby since
     its name is not in the list.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys <literal>s1</>,
+    <literal>s2</> and <literal>s3</> will be considered as synchronous standby
+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3de489e..2c5f3de 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1389,7 +1389,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <row>
      <entry><structfield>sync_state</></entry>
      <entry><type>text</></entry>
-     <entry>Synchronous state of this standby server</entry>
+     <entry>Synchronous state of this standby server. It is <literal>quorum</>
+     when standby is considered as a candidate of quorum commit.</entry>
     </row>
    </tbody>
    </tgroup>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..bcc1317 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -31,16 +31,19 @@
  *
  * In 9.5 or before only a single standby could be considered as
  * synchronous. In 9.6 we support multiple synchronous standbys.
- * The number of synchronous standbys that transactions must wait for
- * replies from is specified in synchronous_standby_names.
- * This parameter also specifies a list of standby names,
- * which determines the priority of each standby for being chosen as
- * a synchronous standby. The standbys whose names appear earlier
- * in the list are given higher priority and will be considered as
- * synchronous. Other standby servers appearing later in this list
- * represent potential synchronous standbys. If any of the current
- * synchronous standbys disconnects for whatever reason, it will be
- * replaced immediately with the next-highest-priority standby.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
  *
  * Before the standbys chosen from synchronous_standby_names can
  * become the synchronous standbys they must have caught up with
@@ -73,24 +76,27 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+SyncRepConfigData *SyncRepConfig = NULL;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
 static bool announce_next_takeover = true;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+								 XLogRecPtr *flushPtr,
+								 XLogRecPtr *applyPtr,
+								 bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +392,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -413,11 +419,10 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +440,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -471,17 +476,50 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected. The caller must hold SyncRepLock.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				"invalid synchronization method is specified \"%d\"",
+				 SyncRepConfig->sync_method));
+}
+
+/*
+ * Calculate the Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr, bool *am_sync)
 {
 	List	   *sync_standbys;
@@ -507,29 +545,74 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 		return false;
 	}
 
-	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
-	 */
-	foreach(cell, sync_standbys)
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
-		XLogRecPtr	write;
-		XLogRecPtr	flush;
-		XLogRecPtr	apply;
-
-		SpinLockAcquire(&walsnd->mutex);
-		write = walsnd->write;
-		flush = walsnd->flush;
-		apply = walsnd->apply;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
-			*writePtr = write;
-		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
-			*flushPtr = flush;
-		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
-			*applyPtr = apply;
+		/*
+		 * Scan through all sync standbys and calculate the oldest
+		 * Write, Flush and Apply positions.
+		 */
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+			XLogRecPtr	write;
+			XLogRecPtr	flush;
+			XLogRecPtr	apply;
+
+			SpinLockAcquire(&walsnd->mutex);
+			write = walsnd->write;
+			flush = walsnd->flush;
+			apply = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+				*writePtr = write;
+			if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+				*flushPtr = flush;
+			if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+				*applyPtr = apply;
+		}
+	}
+	else /* SYNC_REP_QUORUM */
+	{
+		XLogRecPtr	*write_array;
+		XLogRecPtr	*flush_array;
+		XLogRecPtr	*apply_array;
+		int len;
+		int i = 0;
+
+		len = list_length(sync_standbys);
+		write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+			SpinLockAcquire(&walsnd->mutex);
+			write_array[i] = walsnd->write;
+			flush_array[i] = walsnd->flush;
+			apply_array[i] = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			i++;
+		}
+
+		qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+		/*
+		 * Get N-th newest Write, Flush, Apply positions
+		 * specified by SyncRepConfig->num_sync.
+		 */
+		*writePtr = write_array[SyncRepConfig->num_sync - 1];
+		*flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+		*applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+		pfree(write_array);
+		pfree(flush_array);
+		pfree(apply_array);
 	}
 
 	list_free(sync_standbys);
@@ -537,17 +620,66 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using quorum method, or
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
  *
- * If there are multiple standbys with the same priority,
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
  * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
-List *
-SyncRepGetSyncStandbys(bool *am_sync)
+static List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -560,13 +692,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
+	Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
 
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
@@ -892,6 +1018,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +79,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..403fd7d 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,14 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
+{first_ident}	{
+				yylval.str = pstrdup(yytext);
+				return FIRST;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index bc5e508..04fe994 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2860,12 +2860,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 			/*
 			 * More easily understood version of standby state. This is purely
-			 * informational, not different from priority.
+			 * informational. In quorum method, since all standbys are considered as
+			 * a candidate of quorum commit standby state is  always 'quorum'.
 			 */
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..8dd74a3 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int			sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..c502d20 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 11;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
 	$node_master, qq(standby2|2|sync
 standby3|3|sync),
 	'2 synchronous standbys',
-	'2(standby1,standby2,standby3)');
+	'FIRST 2(standby1,standby2,standby3)');
 
 # Start standby1
 $node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
 standby3|3|sync
 standby4|1|sync),
 	'num_sync exceeds the num of potential sync standbys',
-	'6(standby4,standby0,standby3,standby2)');
+	'FIRST 6(standby4,standby0,standby3,standby2)');
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
 	'asterisk comes before another standby name',
-	'2(standby1,*,standby2)');
+	'FIRST 2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
 # earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
 standby3|1|sync
 standby4|1|potential),
 	'multiple standbys having the same priority are chosen as sync',
-	'2(*)');
+	'FIRST 2(*)');
 
 # Stop Standby3 which is considered in 'sync' state.
 $node_standby_3->stop;
@@ -172,3 +172,34 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check that the state of standbys listed as a voter are having
+# same priority when synchronous_standby_names uses quorum method.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'2 quorum and 1 async',
+'ANY 2(standby1, standby2)');
+
+# Check that state of standbys are not the same as the behaviour of that
+# 'ANY' is specified.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'not specify synchronization method',
+'2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'ANY 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+'all standbys are considered as candidates for quorum commit',
+'ANY 2(*)');

#32

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#31)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Tue, Nov 15, 2016 at 7:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest version patch incorporated review comments. After more
thought, I agree and changed the value of standby priority in quorum
method so that it's not set 1 forcibly. The all standby priorities are
1 If s_s_names = 'ANY(*)'.
Please review this patch.

Sorry for my late reply. Here is my final lookup.

 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable
class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...
This can just be replaced with [ ANY | FIRST ]. There is no need for
braces as the keyword is not mandatory.

+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
         </para>
s/servres/servers/.

            if (priority == 0)
                values[7] = CStringGetTextDatum("async");
            else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
            else
                values[7] = CStringGetTextDatum("potential");
This can be simplified a bit as "quorum" is the state value for all
standbys with a non-zero priority when the method is set to
SYNC_REP_QUORUM:
            if (priority == 0)
                values[7] = CStringGetTextDatum("async");
+           else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+               values[7] = CStringGetTextDatum("quorum");
            else if (list_member_int(sync_standbys, i))
                values[7] = CStringGetTextDatum("sync");
            else

SyncRepConfig data is made external to syncrep.c with this patch as
walsender.c needs to look at the sync method in place, no complain
about that after considering if there could be a more elegant way to
do things without this change.

While reviewing the patch, I have found a couple of incorrectly shaped
sentences, both in the docs and some comments. Attached is a new
version with this word-smithing. The patch is now switched as ready
for committer.
--
Michael

Attachments:

000_quorum_commit_v7.patchapplication/x-patch; name=000_quorum_commit_v7.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index dcd0663..bff932b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3029,42 +3029,76 @@ include_dir 'conf.d'
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
         The synchronous standbys will be those whose names appear
-        earlier in this list, and
+        in this list, and
         that are both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
-        <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
+        <literal>pg_stat_replication</></link> view). If the keyword
+        <literal>FIRST</> is specified, other standby servers appearing
+        later in this list represent potential synchronous standbys.
+        If any of the current synchronous standbys disconnects for
+        whatever reason, it will be replaced immediately with the
+        next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ ANY | FIRST ] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servers.
         </para>
         <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        The keyword <literal>FIRST</>, coupled with <literal>num_sync</>,
+        makes transaction commits wait until WAL records are received
+        from the <literal>num_sync</> standbys with highest priority number.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by the three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        </para>
+        <para>
+        The keyword <literal>ANY</>, coupled with <literal>num_sync</>,
+        makes transaction commits wait until WAL records are received
+        from at least <literal>num_sync</> connected standbys among those
+        defined in the list of <varname>synchronous_standby_names</>. For
+        example, a setting of <literal>ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving WAL records from at least
+        three standbys among the four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.
+        </para>
+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive words
+        and the standby name including those words must use double quotes.
+        </para>
+        <para>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
-       </para>
+        with <literal>FIRST</> and <literal>num_sync</replaceable> equal to
+        1. For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
+        </para>
+       <note>
+        <para>
+         If <literal>FIRST</> or <literal>ANY</> are not specified, this
+         parameter behaves as if <literal>ANY</> is used. Note that this
+         grammar is incompatible with <productname>PostgresSQL</> 9.6 which
+         is first version supporting multiple standbys with synchronous
+         replication, where no such keyword <literal>FIRST</> or
+         <literal>ANY</> can be used. Note that the grammer behaves as if
+         <literal>FIRST</> is used, which is incompatible with the post-9.6
+         version behavior.
+        </para>
+       </note>
        <para>
         The name of a standby server for this purpose is the
         <varname>application_name</> setting of the standby, as set in the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 5bedaf2..b57b4ca 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1161,6 +1161,19 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     <literal>s2</> fails. <literal>s4</> is an asynchronous standby since
     its name is not in the list.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys
+    <literal>s1</>, <literal>s2</> and <literal>s3</> will be considered as
+    synchronous standby candidates. The master server will wait for at least
+    2 replies from them. <literal>s4</> is an asynchronous standby since its
+    name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3de489e..2c5f3de 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1389,7 +1389,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <row>
      <entry><structfield>sync_state</></entry>
      <entry><type>text</></entry>
-     <entry>Synchronous state of this standby server</entry>
+     <entry>Synchronous state of this standby server. It is <literal>quorum</>
+     when standby is considered as a candidate of quorum commit.</entry>
     </row>
    </tbody>
    </tgroup>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..efe7182 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -31,16 +31,19 @@
  *
  * In 9.5 or before only a single standby could be considered as
  * synchronous. In 9.6 we support multiple synchronous standbys.
- * The number of synchronous standbys that transactions must wait for
- * replies from is specified in synchronous_standby_names.
- * This parameter also specifies a list of standby names,
- * which determines the priority of each standby for being chosen as
- * a synchronous standby. The standbys whose names appear earlier
- * in the list are given higher priority and will be considered as
- * synchronous. Other standby servers appearing later in this list
- * represent potential synchronous standbys. If any of the current
- * synchronous standbys disconnects for whatever reason, it will be
- * replaced immediately with the next-highest-priority standby.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, all standbys appearing in the list are considered
+ * as candidates for quorum commit.
  *
  * Before the standbys chosen from synchronous_standby_names can
  * become the synchronous standbys they must have caught up with
@@ -73,24 +76,27 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+SyncRepConfigData *SyncRepConfig = NULL;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
 static bool announce_next_takeover = true;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+								 XLogRecPtr *flushPtr,
+								 XLogRecPtr *applyPtr,
+								 bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +392,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -413,11 +419,12 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the flush
+	 * apply and write LSN positions among all sync standbys using the
+	 * method specified.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr,
+									  &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +442,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -471,17 +478,55 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected.
+ *
+ * The caller must hold SyncRepLock.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				"incorrect synchronization method for standbys is specified \"%d\"",
+				 SyncRepConfig->sync_method));
+}
+
+/*
+ * Calculate the Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, the oldest flush, apply and write positions among
+ * all the sync standbys are calculated. In quorum method, in order to
+ * select the subset of standbys in the existing quorum set that will
+ * satisfy the conditions to be selected as synchronous, calculate the
+ * N-th newest flush, apply and write positions. In the latter case,
+ * the N-th element is defined by SyncRepConfig->num_sync.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr, bool *am_sync)
 {
 	List	   *sync_standbys;
@@ -508,46 +553,145 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	}
 
 	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
+	 * Switch through the calculation methods.
 	 */
-	foreach(cell, sync_standbys)
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+	{
+		/*
+		 * Scan through all sync standbys and calculate the oldest
+		 * Write, Flush and Apply positions.
+		 */
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+			XLogRecPtr	write;
+			XLogRecPtr	flush;
+			XLogRecPtr	apply;
+
+			SpinLockAcquire(&walsnd->mutex);
+			write = walsnd->write;
+			flush = walsnd->flush;
+			apply = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+				*writePtr = write;
+			if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+				*flushPtr = flush;
+			if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+				*applyPtr = apply;
+		}
+	}
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
-		XLogRecPtr	write;
-		XLogRecPtr	flush;
-		XLogRecPtr	apply;
-
-		SpinLockAcquire(&walsnd->mutex);
-		write = walsnd->write;
-		flush = walsnd->flush;
-		apply = walsnd->apply;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
-			*writePtr = write;
-		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
-			*flushPtr = flush;
-		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
-			*applyPtr = apply;
+		XLogRecPtr	*write_array;
+		XLogRecPtr	*flush_array;
+		XLogRecPtr	*apply_array;
+		int len;
+		int i = 0;
+
+		len = list_length(sync_standbys);
+		write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+			SpinLockAcquire(&walsnd->mutex);
+			write_array[i] = walsnd->write;
+			flush_array[i] = walsnd->flush;
+			apply_array[i] = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			i++;
+		}
+
+		qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+		/*
+		 * Get N-th newest Write, Flush, Apply positions specified by
+		 * SyncRepConfig->num_sync.
+		 */
+		*writePtr = write_array[SyncRepConfig->num_sync - 1];
+		*flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+		*applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+		pfree(write_array);
+		pfree(flush_array);
+		pfree(apply_array);
 	}
+	else
+		elog(ERROR, "incorrect synchronization method for standbys");
 
 	list_free(sync_standbys);
 	return true;
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using quorum method, or NIL if no sync
+ * standby is connected. So this function returns the list of standbys except
+ * for the standbys which are not active or connected as asynchronous
+ * standbys.
  *
- * If there are multiple standbys with the same priority,
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as synchrounous candidate and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
  * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
-List *
-SyncRepGetSyncStandbys(bool *am_sync)
+static List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -560,13 +704,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
+	Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
 
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
@@ -892,6 +1030,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +79,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..403fd7d 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,14 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
+{first_ident}	{
+				yylval.str = pstrdup(yytext);
+				return FIRST;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aa42d59..8d29dc4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2862,10 +2862,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 			/*
 			 * More easily understood version of standby state. This is purely
-			 * informational, not different from priority.
+			 * informational. In quorum method, since all standbys are
+			 * considered as a candidate of quorum commit, state is always
+			 * set to 'quorum' for all the standbys.
 			 */
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
+			else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+				values[7] = CStringGetTextDatum("quorum");
 			else if (list_member_int(sync_standbys, i))
 				values[7] = CStringGetTextDatum("sync");
 			else
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..8dd74a3 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int			sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..13978bb 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 11;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
 	$node_master, qq(standby2|2|sync
 standby3|3|sync),
 	'2 synchronous standbys',
-	'2(standby1,standby2,standby3)');
+	'FIRST 2(standby1,standby2,standby3)');
 
 # Start standby1
 $node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
 standby3|3|sync
 standby4|1|sync),
 	'num_sync exceeds the num of potential sync standbys',
-	'6(standby4,standby0,standby3,standby2)');
+	'FIRST 6(standby4,standby0,standby3,standby2)');
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
 	'asterisk comes before another standby name',
-	'2(standby1,*,standby2)');
+	'FIRST 2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
 # earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
 standby3|1|sync
 standby4|1|potential),
 	'multiple standbys having the same priority are chosen as sync',
-	'2(*)');
+	'FIRST 2(*)');
 
 # Stop Standby3 which is considered in 'sync' state.
 $node_standby_3->stop;
@@ -172,3 +172,33 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check state of a quorum set made of two standbys with an asynchronous
+# standby.
+test_sync_state(
+	$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+	'quorum set of two standbys with one async standby',
+	'ANY 2(standby1, standby2)');
+
+# Check that the state of standbys does not change when 'ANY' is omitted.
+test_sync_state(
+	$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+	'synchronization method definition omitted, switching to default quorum',
+	'2(standby1, standby2)');
+
+# Start standby3, which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that setting of 'ANY 2(*)' chooses all standbys as candidates for
+# quorum commit.
+test_sync_state(
+	$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+	'all standbys selected as candidates for quorum commit',
+	'ANY 2(*)');

#33

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Michael Paquier (#32)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Sat, Nov 26, 2016 at 10:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 15, 2016 at 7:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest version patch incorporated review comments. After more
thought, I agree and changed the value of standby priority in quorum
method so that it's not set 1 forcibly. The all standby priorities are
1 If s_s_names = 'ANY(*)'.
Please review this patch.

Sorry for my late reply. Here is my final lookup.

Thank you for reviewing!

<synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable
class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
<replaceable class="parameter">standby_name</replaceable> [, ...
This can just be replaced with [ ANY | FIRST ]. There is no need for
braces as the keyword is not mandatory.

+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
</para>
s/servres/servers/.

if (priority == 0)
values[7] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
else
values[7] = CStringGetTextDatum("potential");
This can be simplified a bit as "quorum" is the state value for all
standbys with a non-zero priority when the method is set to
SYNC_REP_QUORUM:
if (priority == 0)
values[7] = CStringGetTextDatum("async");
+           else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+               values[7] = CStringGetTextDatum("quorum");
else if (list_member_int(sync_standbys, i))
values[7] = CStringGetTextDatum("sync");
else

Agreed.

While reviewing the patch, I have found a couple of incorrectly shaped
sentences, both in the docs and some comments. Attached is a new
version with this word-smithing. The patch is now switched as ready
for committer.

Thanks. I found a typo in v7 patch, so attached latest v8 patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_quorum_commit_v8.patchapplication/x-patch; name=000_quorum_commit_v8.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d8d207e..4baff32 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3032,42 +3032,76 @@ include_dir 'conf.d'
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
         The synchronous standbys will be those whose names appear
-        earlier in this list, and
+        in this list, and
         that are both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
-        <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
+        <literal>pg_stat_replication</></link> view). If the keyword
+        <literal>FIRST</> is specified, other standby servers appearing
+        later in this list represent potential synchronous standbys.
+        If any of the current synchronous standbys disconnects for
+        whatever reason, it will be replaced immediately with the
+        next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ ANY | FIRST ] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servers.
         </para>
         <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        The keyword <literal>FIRST</>, coupled with <literal>num_sync</>,
+        makes transaction commits wait until WAL records are received
+        from the <literal>num_sync</> standbys with highest priority number.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by the three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        </para>
+        <para>
+        The keyword <literal>ANY</>, coupled with <literal>num_sync</>,
+        makes transaction commits wait until WAL records are received
+        from at least <literal>num_sync</> connected standbys among those
+        defined in the list of <varname>synchronous_standby_names</>. For
+        example, a setting of <literal>ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving WAL records from at least
+        three standbys among the four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.
+        </para>
+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive words
+        and the standby name including those words must use double quotes.
+        </para>
+        <para>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
-       </para>
+        with <literal>FIRST</> and <literal>num_sync</replaceable> equal to
+        1. For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
+        </para>
+       <note>
+        <para>
+         If <literal>FIRST</> or <literal>ANY</> are not specified, this
+         parameter behaves as if <literal>ANY</> is used. Note that this
+         grammar is incompatible with <productname>PostgresSQL</> 9.6 which
+         is first version supporting multiple standbys with synchronous
+         replication, where no such keyword <literal>FIRST</> or
+         <literal>ANY</> can be used. Note that the grammer behaves as if
+         <literal>FIRST</> is used, which is incompatible with the post-9.6
+         version behavior.
+        </para>
+       </note>
        <para>
         The name of a standby server for this purpose is the
         <varname>application_name</> setting of the standby, as set in the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 5bedaf2..b57b4ca 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1161,6 +1161,19 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     <literal>s2</> fails. <literal>s4</> is an asynchronous standby since
     its name is not in the list.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys
+    <literal>s1</>, <literal>s2</> and <literal>s3</> will be considered as
+    synchronous standby candidates. The master server will wait for at least
+    2 replies from them. <literal>s4</> is an asynchronous standby since its
+    name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3de489e..2c5f3de 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1389,7 +1389,8 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
     <row>
      <entry><structfield>sync_state</></entry>
      <entry><type>text</></entry>
-     <entry>Synchronous state of this standby server</entry>
+     <entry>Synchronous state of this standby server. It is <literal>quorum</>
+     when standby is considered as a candidate of quorum commit.</entry>
     </row>
    </tbody>
    </tgroup>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..26f04e8 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -31,16 +31,19 @@
  *
  * In 9.5 or before only a single standby could be considered as
  * synchronous. In 9.6 we support multiple synchronous standbys.
- * The number of synchronous standbys that transactions must wait for
- * replies from is specified in synchronous_standby_names.
- * This parameter also specifies a list of standby names,
- * which determines the priority of each standby for being chosen as
- * a synchronous standby. The standbys whose names appear earlier
- * in the list are given higher priority and will be considered as
- * synchronous. Other standby servers appearing later in this list
- * represent potential synchronous standbys. If any of the current
- * synchronous standbys disconnects for whatever reason, it will be
- * replaced immediately with the next-highest-priority standby.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, all standbys appearing in the list are considered
+ * as candidates for quorum commit.
  *
  * Before the standbys chosen from synchronous_standby_names can
  * become the synchronous standbys they must have caught up with
@@ -73,24 +76,27 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+SyncRepConfigData *SyncRepConfig = NULL;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
 static bool announce_next_takeover = true;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+								 XLogRecPtr *flushPtr,
+								 XLogRecPtr *applyPtr,
+								 bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +392,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -413,11 +419,12 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the flush
+	 * apply and write LSN positions among all sync standbys using the
+	 * method specified.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr,
+									  &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +442,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -471,17 +478,55 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected.
+ *
+ * The caller must hold SyncRepLock.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				"incorrect synchronization method for standbys is specified \"%d\"",
+				 SyncRepConfig->sync_method));
+}
+
+/*
+ * Calculate the Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, the oldest flush, apply and write positions among
+ * all the sync standbys are calculated. In quorum method, in order to
+ * select the subset of standbys in the existing quorum set that will
+ * satisfy the conditions to be selected as synchronous, calculate the
+ * N-th newest flush, apply and write positions. In the latter case,
+ * the N-th element is defined by SyncRepConfig->num_sync.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr, bool *am_sync)
 {
 	List	   *sync_standbys;
@@ -508,46 +553,145 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	}
 
 	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
+	 * Switch through the calculation methods.
 	 */
-	foreach(cell, sync_standbys)
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+	{
+		/*
+		 * Scan through all sync standbys and calculate the oldest
+		 * Write, Flush and Apply positions.
+		 */
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+			XLogRecPtr	write;
+			XLogRecPtr	flush;
+			XLogRecPtr	apply;
+
+			SpinLockAcquire(&walsnd->mutex);
+			write = walsnd->write;
+			flush = walsnd->flush;
+			apply = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+				*writePtr = write;
+			if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+				*flushPtr = flush;
+			if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+				*applyPtr = apply;
+		}
+	}
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
-		XLogRecPtr	write;
-		XLogRecPtr	flush;
-		XLogRecPtr	apply;
-
-		SpinLockAcquire(&walsnd->mutex);
-		write = walsnd->write;
-		flush = walsnd->flush;
-		apply = walsnd->apply;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
-			*writePtr = write;
-		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
-			*flushPtr = flush;
-		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
-			*applyPtr = apply;
+		XLogRecPtr	*write_array;
+		XLogRecPtr	*flush_array;
+		XLogRecPtr	*apply_array;
+		int len;
+		int i = 0;
+
+		len = list_length(sync_standbys);
+		write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+			SpinLockAcquire(&walsnd->mutex);
+			write_array[i] = walsnd->write;
+			flush_array[i] = walsnd->flush;
+			apply_array[i] = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			i++;
+		}
+
+		qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+		/*
+		 * Get N-th newest Write, Flush, Apply positions specified by
+		 * SyncRepConfig->num_sync.
+		 */
+		*writePtr = write_array[SyncRepConfig->num_sync - 1];
+		*flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+		*applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+		pfree(write_array);
+		pfree(flush_array);
+		pfree(apply_array);
 	}
+	else
+		elog(ERROR, "incorrect synchronization method for standbys");
 
 	list_free(sync_standbys);
 	return true;
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using quorum method, or NIL if no sync
+ * standby is connected. So this function returns the list of standbys except
+ * for the standbys which are not active or connected as asynchronous
+ * standbys.
  *
- * If there are multiple standbys with the same priority,
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as synchronous candidate and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
  * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
-List *
-SyncRepGetSyncStandbys(bool *am_sync)
+static List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -560,13 +704,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
+	Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
 
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
@@ -892,6 +1030,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +79,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..403fd7d 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,14 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
+{first_ident}	{
+				yylval.str = pstrdup(yytext);
+				return FIRST;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aa42d59..8d29dc4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2862,10 +2862,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 			/*
 			 * More easily understood version of standby state. This is purely
-			 * informational, not different from priority.
+			 * informational. In quorum method, since all standbys are
+			 * considered as a candidate of quorum commit, state is always
+			 * set to 'quorum' for all the standbys.
 			 */
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
+			else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+				values[7] = CStringGetTextDatum("quorum");
 			else if (list_member_int(sync_standbys, i))
 				values[7] = CStringGetTextDatum("sync");
 			else
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..8dd74a3 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int			sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..13978bb 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 11;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
 	$node_master, qq(standby2|2|sync
 standby3|3|sync),
 	'2 synchronous standbys',
-	'2(standby1,standby2,standby3)');
+	'FIRST 2(standby1,standby2,standby3)');
 
 # Start standby1
 $node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
 standby3|3|sync
 standby4|1|sync),
 	'num_sync exceeds the num of potential sync standbys',
-	'6(standby4,standby0,standby3,standby2)');
+	'FIRST 6(standby4,standby0,standby3,standby2)');
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
 	'asterisk comes before another standby name',
-	'2(standby1,*,standby2)');
+	'FIRST 2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
 # earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
 standby3|1|sync
 standby4|1|potential),
 	'multiple standbys having the same priority are chosen as sync',
-	'2(*)');
+	'FIRST 2(*)');
 
 # Stop Standby3 which is considered in 'sync' state.
 $node_standby_3->stop;
@@ -172,3 +172,33 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check state of a quorum set made of two standbys with an asynchronous
+# standby.
+test_sync_state(
+	$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+	'quorum set of two standbys with one async standby',
+	'ANY 2(standby1, standby2)');
+
+# Check that the state of standbys does not change when 'ANY' is omitted.
+test_sync_state(
+	$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+	'synchronization method definition omitted, switching to default quorum',
+	'2(standby1, standby2)');
+
+# Start standby3, which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that setting of 'ANY 2(*)' chooses all standbys as candidates for
+# quorum commit.
+test_sync_state(
+	$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+	'all standbys selected as candidates for quorum commit',
+	'ANY 2(*)');

#34

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#33)

Re: Quorum commit for multiple synchronous replication.

On Mon, Nov 28, 2016 at 8:03 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Nov 26, 2016 at 10:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 15, 2016 at 7:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest version patch incorporated review comments. After more
thought, I agree and changed the value of standby priority in quorum
method so that it's not set 1 forcibly. The all standby priorities are
1 If s_s_names = 'ANY(*)'.
Please review this patch.

Sorry for my late reply. Here is my final lookup.

Thank you for reviewing!
<synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable
class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
<replaceable class="parameter">standby_name</replaceable> [, ...
This can just be replaced with [ ANY | FIRST ]. There is no need for
braces as the keyword is not mandatory.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
</para>
s/servres/servers/.
if (priority == 0)
values[7] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
else
values[7] = CStringGetTextDatum("potential");
This can be simplified a bit as "quorum" is the state value for all
standbys with a non-zero priority when the method is set to
SYNC_REP_QUORUM:
if (priority == 0)
values[7] = CStringGetTextDatum("async");
+           else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+               values[7] = CStringGetTextDatum("quorum");
else if (list_member_int(sync_standbys, i))
values[7] = CStringGetTextDatum("sync");
else
SyncRepConfig data is made external to syncrep.c with this patch as
walsender.c needs to look at the sync method in place, no complain
about that after considering if there could be a more elegant way to
do things without this change.
Agreed.

While reviewing the patch, I have found a couple of incorrectly shaped
sentences, both in the docs and some comments. Attached is a new
version with this word-smithing. The patch is now switched as ready
for committer.

Thanks. I found a typo in v7 patch, so attached latest v8 patch.

Moved patch to CF 2017-01, with same status "Ready for committer".
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#33)

Re: Quorum commit for multiple synchronous replication.

On Mon, Nov 28, 2016 at 8:03 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Nov 26, 2016 at 10:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 15, 2016 at 7:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest version patch incorporated review comments. After more
thought, I agree and changed the value of standby priority in quorum
method so that it's not set 1 forcibly. The all standby priorities are
1 If s_s_names = 'ANY(*)'.
Please review this patch.

Sorry for my late reply. Here is my final lookup.

Thank you for reviewing!
<synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable
class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
<replaceable class="parameter">standby_name</replaceable> [, ...
This can just be replaced with [ ANY | FIRST ]. There is no need for
braces as the keyword is not mandatory.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
</para>
s/servres/servers/.
if (priority == 0)
values[7] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
else
values[7] = CStringGetTextDatum("potential");
This can be simplified a bit as "quorum" is the state value for all
standbys with a non-zero priority when the method is set to
SYNC_REP_QUORUM:
if (priority == 0)
values[7] = CStringGetTextDatum("async");
+           else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+               values[7] = CStringGetTextDatum("quorum");
else if (list_member_int(sync_standbys, i))
values[7] = CStringGetTextDatum("sync");
else
SyncRepConfig data is made external to syncrep.c with this patch as
walsender.c needs to look at the sync method in place, no complain
about that after considering if there could be a more elegant way to
do things without this change.
Agreed.

While reviewing the patch, I have found a couple of incorrectly shaped
sentences, both in the docs and some comments. Attached is a new
version with this word-smithing. The patch is now switched as ready
for committer.

Thanks. I found a typo in v7 patch, so attached latest v8 patch.

+        qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+        qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+        qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);

In quorum commit, we need to calculate the N-th largest LSN from
M quorum synchronous standbys' LSN. N would be usually 1 - 3 and
M would be 1 - 10, I guess. You used the algorithm using qsort for
that calculation. But I'm not sure if that's enough effective algorithm
or not.

If M (i.e., number of quorum sync standbys) is enough large,
your choice would be good. But usually M seems not so large.

Thought?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Fujii Masao (#35)

Re: Quorum commit for multiple synchronous replication.

On Tue, Dec 6, 2016 at 1:11 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Nov 28, 2016 at 8:03 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Nov 26, 2016 at 10:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 15, 2016 at 7:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest version patch incorporated review comments. After more
thought, I agree and changed the value of standby priority in quorum
method so that it's not set 1 forcibly. The all standby priorities are
1 If s_s_names = 'ANY(*)'.
Please review this patch.

Sorry for my late reply. Here is my final lookup.

Thank you for reviewing!
<synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable
class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
<replaceable class="parameter">standby_name</replaceable> [, ...
This can just be replaced with [ ANY | FIRST ]. There is no need for
braces as the keyword is not mandatory.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
</para>
s/servres/servers/.
if (priority == 0)
values[7] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
else
values[7] = CStringGetTextDatum("potential");
This can be simplified a bit as "quorum" is the state value for all
standbys with a non-zero priority when the method is set to
SYNC_REP_QUORUM:
if (priority == 0)
values[7] = CStringGetTextDatum("async");
+           else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+               values[7] = CStringGetTextDatum("quorum");
else if (list_member_int(sync_standbys, i))
values[7] = CStringGetTextDatum("sync");
else
SyncRepConfig data is made external to syncrep.c with this patch as
walsender.c needs to look at the sync method in place, no complain
about that after considering if there could be a more elegant way to
do things without this change.
Agreed.

While reviewing the patch, I have found a couple of incorrectly shaped
sentences, both in the docs and some comments. Attached is a new
version with this word-smithing. The patch is now switched as ready
for committer.

Thanks. I found a typo in v7 patch, so attached latest v8 patch.
+        qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+        qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+        qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
In quorum commit, we need to calculate the N-th largest LSN from
M quorum synchronous standbys' LSN. N would be usually 1 - 3 and
M would be 1 - 10, I guess. You used the algorithm using qsort for
that calculation. But I'm not sure if that's enough effective algorithm
or not.

If M (i.e., number of quorum sync standbys) is enough large,
your choice would be good. But usually M seems not so large.

Thank you for the comment.

One another possible idea is to use the partial selection sort[1]https://en.wikipedia.org/wiki/Selection_algorithm#Partial_selection_sort,
which takes O(MN) time. Since this is more efficient if N is small
this would be better than qsort for this case. But I'm not sure that
we can see such a difference by result of performance measurement.

[1]: https://en.wikipedia.org/wiki/Selection_algorithm#Partial_selection_sort

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#36)

Re: Quorum commit for multiple synchronous replication.

On Tue, Dec 6, 2016 at 6:57 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 6, 2016 at 1:11 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

If M (i.e., number of quorum sync standbys) is enough large,
your choice would be good. But usually M seems not so large.

Thank you for the comment.

One another possible idea is to use the partial selection sort[1],
which takes O(MN) time. Since this is more efficient if N is small
this would be better than qsort for this case. But I'm not sure that
we can see such a difference by result of performance measurement.

[1] https://en.wikipedia.org/wiki/Selection_algorithm#Partial_selection_sort

We'll begin to see a minimal performance impact when selecting a sync
standby across hundreds of them, which is less than say what 0.1% (or
less) of existing deployments are doing. The current approach taken
seems simple enough to be kept, and performance is not something to
worry much IMHO.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#36)

Re: Quorum commit for multiple synchronous replication.

On Tue, Dec 6, 2016 at 6:57 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 6, 2016 at 1:11 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Nov 28, 2016 at 8:03 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Nov 26, 2016 at 10:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Nov 15, 2016 at 7:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest version patch incorporated review comments. After more
thought, I agree and changed the value of standby priority in quorum
method so that it's not set 1 forcibly. The all standby priorities are
1 If s_s_names = 'ANY(*)'.
Please review this patch.

Sorry for my late reply. Here is my final lookup.

Thank you for reviewing!
<synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable
class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> (
<replaceable class="parameter">standby_name</replaceable> [, ...] )
<replaceable class="parameter">standby_name</replaceable> [, ...
This can just be replaced with [ ANY | FIRST ]. There is no need for
braces as the keyword is not mandatory.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
</para>
s/servres/servers/.
if (priority == 0)
values[7] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
-               values[7] = CStringGetTextDatum("sync");
+               values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+                   CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
else
values[7] = CStringGetTextDatum("potential");
This can be simplified a bit as "quorum" is the state value for all
standbys with a non-zero priority when the method is set to
SYNC_REP_QUORUM:
if (priority == 0)
values[7] = CStringGetTextDatum("async");
+           else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+               values[7] = CStringGetTextDatum("quorum");
else if (list_member_int(sync_standbys, i))
values[7] = CStringGetTextDatum("sync");
else
SyncRepConfig data is made external to syncrep.c with this patch as
walsender.c needs to look at the sync method in place, no complain
about that after considering if there could be a more elegant way to
do things without this change.
Agreed.

While reviewing the patch, I have found a couple of incorrectly shaped
sentences, both in the docs and some comments. Attached is a new
version with this word-smithing. The patch is now switched as ready
for committer.

Thanks. I found a typo in v7 patch, so attached latest v8 patch.
+        qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+        qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+        qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
In quorum commit, we need to calculate the N-th largest LSN from
M quorum synchronous standbys' LSN. N would be usually 1 - 3 and
M would be 1 - 10, I guess. You used the algorithm using qsort for
that calculation. But I'm not sure if that's enough effective algorithm
or not.

If M (i.e., number of quorum sync standbys) is enough large,
your choice would be good. But usually M seems not so large.
Thank you for the comment.

One another possible idea is to use the partial selection sort[1],
which takes O(MN) time. Since this is more efficient if N is small
this would be better than qsort for this case. But I'm not sure that
we can see such a difference by result of performance measurement.

So, isn't it better to compare the performance of some algorithms and
confirm which is the best for quorum commit? Since this code is hot, i.e.,
can be very frequently executed, I'd like to avoid waste of cycle as much
as possible.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Fujii Masao (#38)

Re: Quorum commit for multiple synchronous replication.

On Wed, Dec 7, 2016 at 12:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

So, isn't it better to compare the performance of some algorithms and
confirm which is the best for quorum commit? Since this code is hot, i.e.,
can be very frequently executed, I'd like to avoid waste of cycle as much
as possible.

It seems to me that it would be simple enough to write a script to do
that to avoid any other noise: allocate an array with N random
elements, and fetch the M-th element from it after applying a sort
method. I highly doubt that you'd see much difference with a low
number of elements, now if you scale at a thousand standbys in a
quorum set you may surely see something :*)
Anybody willing to try out?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Michael Paquier (#39)

Re: Quorum commit for multiple synchronous replication.

At Wed, 7 Dec 2016 13:26:38 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSyfsg=gHeqgXyzP0iGWvdyrXqnG-UENzfueaU=2m5-zg@mail.gmail.com>

On Wed, Dec 7, 2016 at 12:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

So, isn't it better to compare the performance of some algorithms and
confirm which is the best for quorum commit? Since this code is hot, i.e.,
can be very frequently executed, I'd like to avoid waste of cycle as much
as possible.

It seems to me that it would be simple enough to write a script to do
that to avoid any other noise: allocate an array with N random
elements, and fetch the M-th element from it after applying a sort
method. I highly doubt that you'd see much difference with a low
number of elements, now if you scale at a thousand standbys in a
quorum set you may surely see something :*)
Anybody willing to try out?

Aside from measurement of the two sorting methods, I'd like to
point out that quorum commit basically doesn't need
sorting. Counting comforming santdbys while scanning the
walsender(receiver) LSN list comparing with the target LSN is
O(n). Small refactoring of SyncRerpGetOldestSyncRecPtr would
enough to do that.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#40)

Re: Quorum commit for multiple synchronous replication.

On Wed, Dec 7, 2016 at 2:49 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Aside from measurement of the two sorting methods, I'd like to
point out that quorum commit basically doesn't need
sorting. Counting conforming santdbys while scanning the
walsender(receiver) LSN list comparing with the target LSN is
O(n). Small refactoring of SyncRerpGetOldestSyncRecPtr would
enough to do that.

Indeed, I haven't thought about that, and that's a no-brainer. That
would remove the need to allocate and sort each array, what is simply
needed is to track the number of times a newest value has been found.
So what this processing would do is updating the write/flush/apply
values for the first k loops if the new value is *older* than the
current one, where k is the quorum number, and between k+1 and N the
value gets updated only if the value compared is newer. No need to
take the mutex lock for a long time as well. By the way, the patch now
conflicts on HEAD, it needs a refresh.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Michael Paquier (#41)

Re: Quorum commit for multiple synchronous replication.

On Wed, Dec 7, 2016 at 4:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 7, 2016 at 2:49 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Aside from measurement of the two sorting methods, I'd like to
point out that quorum commit basically doesn't need
sorting. Counting conforming santdbys while scanning the
walsender(receiver) LSN list comparing with the target LSN is
O(n). Small refactoring of SyncRerpGetOldestSyncRecPtr would
enough to do that.

What does the target LSN mean here?

Indeed, I haven't thought about that, and that's a no-brainer. That
would remove the need to allocate and sort each array, what is simply
needed is to track the number of times a newest value has been found.
So what this processing would do is updating the write/flush/apply
values for the first k loops if the new value is *older* than the
current one, where k is the quorum number, and between k+1 and N the
value gets updated only if the value compared is newer. No need to
take the mutex lock for a long time as well.

Sorry, I could not understand this algorithm. Could you elaborate
this? It takes only O(n) times?

By the way, the patch now
conflicts on HEAD, it needs a refresh.

Thanks, I'll post the latest patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#42)

Re: Quorum commit for multiple synchronous replication.

On Wed, Dec 7, 2016 at 5:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 7, 2016 at 4:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Indeed, I haven't thought about that, and that's a no-brainer. That
would remove the need to allocate and sort each array, what is simply
needed is to track the number of times a newest value has been found.
So what this processing would do is updating the write/flush/apply
values for the first k loops if the new value is *older* than the
current one, where k is the quorum number, and between k+1 and N the
value gets updated only if the value compared is newer. No need to
take the mutex lock for a long time as well.

Sorry, I could not understand this algorithm. Could you elaborate
this? It takes only O(n) times?

Nah, please forget that, that was a random useless thought. There is
no way to be able to select the k-th element without knowing the
hierarchy induced by the others, which is what the partial sort would
help with here.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Michael Paquier (#39)

Re: Quorum commit for multiple synchronous replication.

On Tue, Dec 6, 2016 at 11:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 7, 2016 at 12:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

So, isn't it better to compare the performance of some algorithms and
confirm which is the best for quorum commit? Since this code is hot, i.e.,
can be very frequently executed, I'd like to avoid waste of cycle as much
as possible.

It seems to me that it would be simple enough to write a script to do
that to avoid any other noise: allocate an array with N random
elements, and fetch the M-th element from it after applying a sort
method. I highly doubt that you'd see much difference with a low
number of elements, now if you scale at a thousand standbys in a
quorum set you may surely see something :*)
Anybody willing to try out?

You could do that, but first I would code up the simplest, cleanest
algorithm you can think of and see if it even shows up in a 'perf'
profile. Microbenchmarking is probably overkill here unless a problem
is visible on macrobenchmarks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Michael Paquier (#43)

Re: Quorum commit for multiple synchronous replication.

Hello, context switch was complete that time, sorry.

There's multiple "target LET"s. So we need kth-largest LTEs.

At Wed, 7 Dec 2016 19:04:23 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqR10OnEL5XxW1DVYvAXmtpEVNCMi=V-6Jb_9owFuY8aSg@mail.gmail.com>

On Wed, Dec 7, 2016 at 5:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Sorry, I could not understand this algorithm. Could you elaborate
this? It takes only O(n) times?

Nah, please forget that, that was a random useless thought. There is
no way to be able to select the k-th element without knowing the
hierarchy induced by the others, which is what the partial sort would
help with here.

So, let's consider for some cases,

- needing 3-quorum among 5 standbys.

There's no problem whatever make kth-largest we choose.
Of course qsorts are fine.

- needing 10 quorums among 100 standbys.

I'm not sure if there's any difference with 3/5.

- needing 10 quorums among 1000 standbys.
Obviously qsorts are doing too much. Any kind of kth-largest
algorithm should be involved. For instance quickselect with O(n
long n) - O(n) seems better in comparison to O(n log n) - O(n^2)
of qsort.

- needing 700 quorums among 1000 standbys.

I don't think this case is worth consider but kth-largest is
better even for this case.

If we don't 700/1000 is out of at least the current scope, I also
recommend to use kth-largest selection.

If not, the quorum calculation is triggered by every standby
reply message and the frequency of the calculation seems near to
the frequency of WAL-insertion for the worst case. (Right?) Even
the kth-largest takes too long time to have 1000 standys.

Maintining kth-largest in shared memory needs less CPU but leads
to more bad contention on the shared memory.

Inversely, we already have waiting LSNs of backends in procarray.
If we have another array in the order of waiting LSNs and having
a condition varialble on the number of comforming
walsenders. Every walsender can individually looking up it and
count it up. It might performs better but I'm not sure.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Robert Haas (#44)

Re: Quorum commit for multiple synchronous replication.

On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

You could do that, but first I would code up the simplest, cleanest
algorithm you can think of and see if it even shows up in a 'perf'
profile. Microbenchmarking is probably overkill here unless a problem
is visible on macrobenchmarks.

This is what I would go for! The current code is doing a simple thing:
select the Nth element using qsort() after scanning each WAL sender's
values. And I think that Sawada-san got it right. Even running on my
laptop a pgbench run with 10 sync standbys using a data set that fits
into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
using perf top on a non-assert, non-debug build. Hash tables and
allocations get a far larger share. Using the patch,
SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
nodes. Let's kick the ball for now. An extra patch could make things
better later on if that's worth it.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Michael Paquier (#46)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Thu, Dec 8, 2016 at 4:39 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

You could do that, but first I would code up the simplest, cleanest
algorithm you can think of and see if it even shows up in a 'perf'
profile. Microbenchmarking is probably overkill here unless a problem
is visible on macrobenchmarks.

This is what I would go for! The current code is doing a simple thing:
select the Nth element using qsort() after scanning each WAL sender's
values. And I think that Sawada-san got it right. Even running on my
laptop a pgbench run with 10 sync standbys using a data set that fits
into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
using perf top on a non-assert, non-debug build. Hash tables and
allocations get a far larger share. Using the patch,
SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
nodes. Let's kick the ball for now. An extra patch could make things
better later on if that's worth it.

Yeah, since the both K and N could be not large these algorithm takes
almost the same time. And current patch does simple thing. When we
need over 100 or 1000 replication node the optimization could be
required.
Attached latest v9 patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_quorum_commit_v9.patchtext/x-diff; charset=US-ASCII; name=000_quorum_commit_v9.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0fc4e57..bc67a99 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3054,42 +3054,75 @@ include_dir 'conf.d'
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
         The synchronous standbys will be those whose names appear
-        earlier in this list, and
+        in this list, and
         that are both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
-        <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
+        <literal>pg_stat_replication</></link> view). If the keyword
+        <literal>FIRST</> is specified, other standby servers appearing
+        later in this list represent potential synchronous standbys.
+        If any of the current synchronous standbys disconnects for
+        whatever reason, it will be replaced immediately with the
+        next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
         </para>
         <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        The keyword <literal>FIRST</>, coupled with <literal>num_sync</>, makes
+        transaction commit wait until WAL records are received from the
+        <literal>num_sync</> standbys with higher priority number.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        </para>
+        <para>
+        The keyword <literal>ANY</>, coupled with <literal>num_sync</>,
+        makes transaction commits wait until WAL records are received
+        from at least <literal>num_sync</> connected standbys among those
+        defined in the list of <varname>synchronous_standby_names</>. For
+        example, a setting of <literal>ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving WAL records from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.
+        </para>
+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive words
+        and the standby name having these words are must be double-quoted.
+        </para>
+        <para>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
-       </para>
+        with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+        equal to 1. For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
+        </para>
+        <note>
+         <para>
+         If <literal>FIRST</> or <literal>ANY</> are not specified, this parameter
+         behaves as if <literal>ANY</> is used. Note that this grammar is incompatible
+         with <productname>PostgresSQL</> 9.6 which is first version supporting multiple
+         standbys with synchronous replication, where no such keyword <literal>FIRST</>
+         or <literal>ANY</> can be used. Note that the grammer behaves as if <literal>FIRST</>
+         is used, which is incompatible with the post-9.6 version behavior.
+        </para>
+       </note>
        <para>
         The name of a standby server for this purpose is the
         <varname>application_name</> setting of the standby, as set in the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6b89507..26e3c4e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1165,6 +1165,18 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     The synchronous states of standby servers can be viewed using
     the <structname>pg_stat_replication</structname> view.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys <literal>s1</>,
+    <literal>s2</> and <literal>s3</> will be considered as synchronous standby
+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 128ee13..771787d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1437,6 +1437,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
            <literal>sync</>: This standby server is synchronous.
           </para>
          </listitem>
+         <listitem>
+         <para>
+          <literal>quorum</>: This standby is considered as a candidate of quorum commit.
+         </para>
+         </listitem>
        </itemizedlist>
      </entry>
     </row>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..bcc1317 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -31,16 +31,19 @@
  *
  * In 9.5 or before only a single standby could be considered as
  * synchronous. In 9.6 we support multiple synchronous standbys.
- * The number of synchronous standbys that transactions must wait for
- * replies from is specified in synchronous_standby_names.
- * This parameter also specifies a list of standby names,
- * which determines the priority of each standby for being chosen as
- * a synchronous standby. The standbys whose names appear earlier
- * in the list are given higher priority and will be considered as
- * synchronous. Other standby servers appearing later in this list
- * represent potential synchronous standbys. If any of the current
- * synchronous standbys disconnects for whatever reason, it will be
- * replaced immediately with the next-highest-priority standby.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
  *
  * Before the standbys chosen from synchronous_standby_names can
  * become the synchronous standbys they must have caught up with
@@ -73,24 +76,27 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+SyncRepConfigData *SyncRepConfig = NULL;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
 static bool announce_next_takeover = true;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+								 XLogRecPtr *flushPtr,
+								 XLogRecPtr *applyPtr,
+								 bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +392,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -413,11 +419,10 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +440,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -471,17 +476,50 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Return the list of sync standbys using according to synchronous method,
+ * or NIL if no sync standby is connected. The caller must hold SyncRepLock.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				"invalid synchronization method is specified \"%d\"",
+				 SyncRepConfig->sync_method));
+}
+
+/*
+ * Calculate the Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr, bool *am_sync)
 {
 	List	   *sync_standbys;
@@ -507,29 +545,74 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 		return false;
 	}
 
-	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
-	 */
-	foreach(cell, sync_standbys)
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
-		XLogRecPtr	write;
-		XLogRecPtr	flush;
-		XLogRecPtr	apply;
-
-		SpinLockAcquire(&walsnd->mutex);
-		write = walsnd->write;
-		flush = walsnd->flush;
-		apply = walsnd->apply;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
-			*writePtr = write;
-		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
-			*flushPtr = flush;
-		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
-			*applyPtr = apply;
+		/*
+		 * Scan through all sync standbys and calculate the oldest
+		 * Write, Flush and Apply positions.
+		 */
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+			XLogRecPtr	write;
+			XLogRecPtr	flush;
+			XLogRecPtr	apply;
+
+			SpinLockAcquire(&walsnd->mutex);
+			write = walsnd->write;
+			flush = walsnd->flush;
+			apply = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+				*writePtr = write;
+			if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+				*flushPtr = flush;
+			if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+				*applyPtr = apply;
+		}
+	}
+	else /* SYNC_REP_QUORUM */
+	{
+		XLogRecPtr	*write_array;
+		XLogRecPtr	*flush_array;
+		XLogRecPtr	*apply_array;
+		int len;
+		int i = 0;
+
+		len = list_length(sync_standbys);
+		write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+			SpinLockAcquire(&walsnd->mutex);
+			write_array[i] = walsnd->write;
+			flush_array[i] = walsnd->flush;
+			apply_array[i] = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			i++;
+		}
+
+		qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+		/*
+		 * Get N-th newest Write, Flush, Apply positions
+		 * specified by SyncRepConfig->num_sync.
+		 */
+		*writePtr = write_array[SyncRepConfig->num_sync - 1];
+		*flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+		*applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+		pfree(write_array);
+		pfree(flush_array);
+		pfree(apply_array);
 	}
 
 	list_free(sync_standbys);
@@ -537,17 +620,66 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using quorum method, or
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
  *
- * If there are multiple standbys with the same priority,
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
  * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
-List *
-SyncRepGetSyncStandbys(bool *am_sync)
+static List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -560,13 +692,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
+	Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
 
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
@@ -892,6 +1018,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +79,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..403fd7d 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,14 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
+{first_ident}	{
+				yylval.str = pstrdup(yytext);
+				return FIRST;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aa42d59..28c3eba 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2862,12 +2862,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 			/*
 			 * More easily understood version of standby state. This is purely
-			 * informational, not different from priority.
+			 * informational. In quorum method, since all standbys are considered as
+			 * a candidate of quorum commit standby state is  always 'quorum'.
 			 */
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 46eecbf..3ad8186 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2637,16 +2637,8 @@ mergeruns(Tuplesortstate *state)
 	}
 
 	/*
-	 * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
-	 * from each input tape.
-	 */
-	state->memtupsize = numInputTapes;
-	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
-	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
-	/*
-	 * Use all the remaining memory we have available for read buffers among
-	 * the input tapes.
+	 * Use all the spare memory we have available for read buffers among the
+	 * input tapes.
 	 *
 	 * We do this only after checking for the case that we produced only one
 	 * initial run, because there is no need to use a large read buffer when
@@ -2669,9 +2661,17 @@ mergeruns(Tuplesortstate *state)
 			 (state->availMem) / 1024, numInputTapes);
 #endif
 
-	state->read_buffer_size = Min(state->availMem / numInputTapes, 0);
+	state->read_buffer_size = state->availMem / numInputTapes;
 	USEMEM(state, state->availMem);
 
+	/*
+	 * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
+	 * from each input tape.
+	 */
+	state->memtupsize = numInputTapes;
+	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewindForRead(state->tapeset, tapenum, state->read_buffer_size);
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..8dd74a3 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int			sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..c502d20 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 11;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
 	$node_master, qq(standby2|2|sync
 standby3|3|sync),
 	'2 synchronous standbys',
-	'2(standby1,standby2,standby3)');
+	'FIRST 2(standby1,standby2,standby3)');
 
 # Start standby1
 $node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
 standby3|3|sync
 standby4|1|sync),
 	'num_sync exceeds the num of potential sync standbys',
-	'6(standby4,standby0,standby3,standby2)');
+	'FIRST 6(standby4,standby0,standby3,standby2)');
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
 	'asterisk comes before another standby name',
-	'2(standby1,*,standby2)');
+	'FIRST 2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
 # earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
 standby3|1|sync
 standby4|1|potential),
 	'multiple standbys having the same priority are chosen as sync',
-	'2(*)');
+	'FIRST 2(*)');
 
 # Stop Standby3 which is considered in 'sync' state.
 $node_standby_3->stop;
@@ -172,3 +172,34 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check that the state of standbys listed as a voter are having
+# same priority when synchronous_standby_names uses quorum method.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'2 quorum and 1 async',
+'ANY 2(standby1, standby2)');
+
+# Check that state of standbys are not the same as the behaviour of that
+# 'ANY' is specified.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'not specify synchronization method',
+'2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'ANY 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+'all standbys are considered as candidates for quorum commit',
+'ANY 2(*)');

#48

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#47)

Re: Quorum commit for multiple synchronous replication.

On Thu, Dec 8, 2016 at 3:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Dec 8, 2016 at 4:39 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

You could do that, but first I would code up the simplest, cleanest
algorithm you can think of and see if it even shows up in a 'perf'
profile. Microbenchmarking is probably overkill here unless a problem
is visible on macrobenchmarks.

This is what I would go for! The current code is doing a simple thing:
select the Nth element using qsort() after scanning each WAL sender's
values. And I think that Sawada-san got it right. Even running on my
laptop a pgbench run with 10 sync standbys using a data set that fits
into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
using perf top on a non-assert, non-debug build. Hash tables and
allocations get a far larger share. Using the patch,
SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
nodes. Let's kick the ball for now. An extra patch could make things
better later on if that's worth it.

Yeah, since the both K and N could be not large these algorithm takes
almost the same time. And current patch does simple thing. When we
need over 100 or 1000 replication node the optimization could be
required.
Attached latest v9 patch.

Few comments:

+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.

In the above description, is priority method represented by FIRST and
quorum method by ANY in the synchronous_standby_names syntax? If so,
it might be better to write about it explicitly.

2.
 --- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2637,16 +2637,8 @@ mergeruns(Tuplesortstate *state)
  }

  /*
- * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
- * from each input tape.
- */
- state->memtupsize = numInputTapes;
- state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
- USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
- /*
- * Use all the remaining memory we have available for read buffers among
- * the input tapes.
+ * Use all the spare memory we have available for read buffers among the
+ * input tapes.

This doesn't belong to this patch.

3.
+ * Return the list of sync standbys using according to synchronous method,

In above sentence, "using according" seems to either incomplete or wrong usage.

4.
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ "invalid synchronization method is specified \"%d\"",
+ SyncRepConfig->sync_method));

Here, the error message doesn't seem to aligned and you might want to
use errmsg for the same.

5.
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.

/oldest these/oldest of these
/newest these positions specified/newest of these positions as specified

Instead of newest, can we consider to use latest?

6.
+ int sync_method; /* synchronization method */
/* member_names contains nmembers consecutive nul-terminated C strings */
char member_names[FLEXIBLE_ARRAY_MEMBER];
} SyncRepConfigData;

Can't we use 1 or 2 bytes to store sync_method information?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Amit Kapila (#48)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Sat, Dec 10, 2016 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 8, 2016 at 3:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Dec 8, 2016 at 4:39 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

You could do that, but first I would code up the simplest, cleanest
algorithm you can think of and see if it even shows up in a 'perf'
profile. Microbenchmarking is probably overkill here unless a problem
is visible on macrobenchmarks.

This is what I would go for! The current code is doing a simple thing:
select the Nth element using qsort() after scanning each WAL sender's
values. And I think that Sawada-san got it right. Even running on my
laptop a pgbench run with 10 sync standbys using a data set that fits
into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
using perf top on a non-assert, non-debug build. Hash tables and
allocations get a far larger share. Using the patch,
SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
nodes. Let's kick the ball for now. An extra patch could make things
better later on if that's worth it.

Yeah, since the both K and N could be not large these algorithm takes
almost the same time. And current patch does simple thing. When we
need over 100 or 1000 replication node the optimization could be
required.
Attached latest v9 patch.

Few comments:

Thank you for reviewing.

+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.

In the above description, is priority method represented by FIRST and
quorum method by ANY in the synchronous_standby_names syntax? If so,
it might be better to write about it explicitly.

Added description.

2.
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2637,16 +2637,8 @@ mergeruns(Tuplesortstate *state)
}

/*
- * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
- * from each input tape.
- */
- state->memtupsize = numInputTapes;
- state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
- USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
- /*
- * Use all the remaining memory we have available for read buffers among
- * the input tapes.
+ * Use all the spare memory we have available for read buffers among the
+ * input tapes.

This doesn't belong to this patch.

Oops, fixed.

3.
+ * Return the list of sync standbys using according to synchronous method,

In above sentence, "using according" seems to either incomplete or wrong usage.

Fixed.

4.
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ "invalid synchronization method is specified \"%d\"",
+ SyncRepConfig->sync_method));
Here, the error message doesn't seem to aligned and you might want to
use errmsg for the same.

Fixed.

5.
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
/oldest these/oldest of these
/newest these positions specified/newest of these positions as specified

Fixed.

Instead of newest, can we consider to use latest?

Yeah, I changed it so.

6.
+ int sync_method; /* synchronization method */
/* member_names contains nmembers consecutive nul-terminated C strings */
char member_names[FLEXIBLE_ARRAY_MEMBER];
} SyncRepConfigData;

Can't we use 1 or 2 bytes to store sync_method information?

I changed it to uint8.

Attached latest v10 patch incorporated the review comments so far.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_quorum_commit_v10.patchtext/x-diff; charset=US-ASCII; name=000_quorum_commit_v10.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0fc4e57..bc67a99 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3054,42 +3054,75 @@ include_dir 'conf.d'
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
         The synchronous standbys will be those whose names appear
-        earlier in this list, and
+        in this list, and
         that are both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
-        <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
+        <literal>pg_stat_replication</></link> view). If the keyword
+        <literal>FIRST</> is specified, other standby servers appearing
+        later in this list represent potential synchronous standbys.
+        If any of the current synchronous standbys disconnects for
+        whatever reason, it will be replaced immediately with the
+        next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
         </para>
         <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        The keyword <literal>FIRST</>, coupled with <literal>num_sync</>, makes
+        transaction commit wait until WAL records are received from the
+        <literal>num_sync</> standbys with higher priority number.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        </para>
+        <para>
+        The keyword <literal>ANY</>, coupled with <literal>num_sync</>,
+        makes transaction commits wait until WAL records are received
+        from at least <literal>num_sync</> connected standbys among those
+        defined in the list of <varname>synchronous_standby_names</>. For
+        example, a setting of <literal>ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving WAL records from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>.
+        </para>
+        <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive words
+        and the standby name having these words are must be double-quoted.
+        </para>
+        <para>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
-       </para>
+        with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+        equal to 1. For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
+        </para>
+        <note>
+         <para>
+         If <literal>FIRST</> or <literal>ANY</> are not specified, this parameter
+         behaves as if <literal>ANY</> is used. Note that this grammar is incompatible
+         with <productname>PostgresSQL</> 9.6 which is first version supporting multiple
+         standbys with synchronous replication, where no such keyword <literal>FIRST</>
+         or <literal>ANY</> can be used. Note that the grammer behaves as if <literal>FIRST</>
+         is used, which is incompatible with the post-9.6 version behavior.
+        </para>
+       </note>
        <para>
         The name of a standby server for this purpose is the
         <varname>application_name</> setting of the standby, as set in the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6b89507..26e3c4e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1165,6 +1165,18 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     The synchronous states of standby servers can be viewed using
     the <structname>pg_stat_replication</structname> view.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys <literal>s1</>,
+    <literal>s2</> and <literal>s3</> will be considered as synchronous standby
+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 128ee13..771787d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1437,6 +1437,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
            <literal>sync</>: This standby server is synchronous.
           </para>
          </listitem>
+         <listitem>
+         <para>
+          <literal>quorum</>: This standby is considered as a candidate of quorum commit.
+         </para>
+         </listitem>
        </itemizedlist>
      </entry>
     </row>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..75649f1 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -31,16 +31,21 @@
  *
  * In 9.5 or before only a single standby could be considered as
  * synchronous. In 9.6 we support multiple synchronous standbys.
- * The number of synchronous standbys that transactions must wait for
- * replies from is specified in synchronous_standby_names.
- * This parameter also specifies a list of standby names,
- * which determines the priority of each standby for being chosen as
- * a synchronous standby. The standbys whose names appear earlier
- * in the list are given higher priority and will be considered as
- * synchronous. Other standby servers appearing later in this list
- * represent potential synchronous standbys. If any of the current
- * synchronous standbys disconnects for whatever reason, it will be
- * replaced immediately with the next-highest-priority standby.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are
+ * specified in synchronous_standby_names. The priority method is
+ * represented by FIRST, and the quorum method is represented by ANY
+ * This parameter also specifies a list of standby names, which
+ * determines the priority of each standby for being chosen as a
+ * synchronous standby. In priority method, the standbys whose names
+ * appear earlier in the list are given higher priority and will be
+ * considered as synchronous. Other standby servers appearing later
+ * in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
  *
  * Before the standbys chosen from synchronous_standby_names can
  * become the synchronous standbys they must have caught up with
@@ -73,24 +78,27 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+SyncRepConfigData *SyncRepConfig = NULL;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
 static bool announce_next_takeover = true;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+								 XLogRecPtr *flushPtr,
+								 XLogRecPtr *applyPtr,
+								 bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +394,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -413,11 +421,10 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +442,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -471,17 +478,50 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Return the list of sync standbys according to synchronous method, or
+ * reutrn NIL if no sync standby is connected. The caller must hold SyncRepLock.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid synchronization method is specified \"%d\"",
+						SyncRepConfig->sync_method)));
+}
+
+/*
+ * Calculate the Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, we need the oldest of these positions among sync
+ * standbys. In quorum method, we need the latest of these positions
+ * as specified by SyncRepConfig->num_sync.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr, bool *am_sync)
 {
 	List	   *sync_standbys;
@@ -507,29 +547,74 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 		return false;
 	}
 
-	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
-	 */
-	foreach(cell, sync_standbys)
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
-		XLogRecPtr	write;
-		XLogRecPtr	flush;
-		XLogRecPtr	apply;
-
-		SpinLockAcquire(&walsnd->mutex);
-		write = walsnd->write;
-		flush = walsnd->flush;
-		apply = walsnd->apply;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
-			*writePtr = write;
-		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
-			*flushPtr = flush;
-		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
-			*applyPtr = apply;
+		/*
+		 * Scan through all sync standbys and calculate the oldest
+		 * Write, Flush and Apply positions.
+		 */
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+			XLogRecPtr	write;
+			XLogRecPtr	flush;
+			XLogRecPtr	apply;
+
+			SpinLockAcquire(&walsnd->mutex);
+			write = walsnd->write;
+			flush = walsnd->flush;
+			apply = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+				*writePtr = write;
+			if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+				*flushPtr = flush;
+			if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+				*applyPtr = apply;
+		}
+	}
+	else /* SYNC_REP_QUORUM */
+	{
+		XLogRecPtr	*write_array;
+		XLogRecPtr	*flush_array;
+		XLogRecPtr	*apply_array;
+		int len;
+		int i = 0;
+
+		len = list_length(sync_standbys);
+		write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+			SpinLockAcquire(&walsnd->mutex);
+			write_array[i] = walsnd->write;
+			flush_array[i] = walsnd->flush;
+			apply_array[i] = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			i++;
+		}
+
+		qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+		/*
+		 * Get N-th latest Write, Flush, Apply positions
+		 * specified by SyncRepConfig->num_sync.
+		 */
+		*writePtr = write_array[SyncRepConfig->num_sync - 1];
+		*flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+		*applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+		pfree(write_array);
+		pfree(flush_array);
+		pfree(apply_array);
 	}
 
 	list_free(sync_standbys);
@@ -537,17 +622,66 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using quorum method, or return
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
  *
- * If there are multiple standbys with the same priority,
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
  * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
-List *
-SyncRepGetSyncStandbys(bool *am_sync)
+static List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -560,13 +694,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
+	Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
 
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
@@ -892,6 +1020,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +79,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..403fd7d 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,14 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{
+				yylval.str = pstrdup(yytext);
+				return ANY;
+		}
+{first_ident}	{
+				yylval.str = pstrdup(yytext);
+				return FIRST;
+		}
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aa42d59..28c3eba 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2862,12 +2862,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 			/*
 			 * More easily understood version of standby state. This is purely
-			 * informational, not different from priority.
+			 * informational. In quorum method, since all standbys are considered as
+			 * a candidate of quorum commit standby state is  always 'quorum'.
 			 */
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..5ceb4b9 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int8		sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..e893ba0 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 11;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
 	$node_master, qq(standby2|2|sync
 standby3|3|sync),
 	'2 synchronous standbys',
-	'2(standby1,standby2,standby3)');
+	'FIRST 2(standby1,standby2,standby3)');
 
 # Start standby1
 $node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
 standby3|3|sync
 standby4|1|sync),
 	'num_sync exceeds the num of potential sync standbys',
-	'6(standby4,standby0,standby3,standby2)');
+	'FIRST 6(standby4,standby0,standby3,standby2)');
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
 	'asterisk comes before another standby name',
-	'2(standby1,*,standby2)');
+	'FIRST 2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
 # earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
 standby3|1|sync
 standby4|1|potential),
 	'multiple standbys having the same priority are chosen as sync',
-	'2(*)');
+	'FIRST 2(*)');
 
 # Stop Standby3 which is considered in 'sync' state.
 $node_standby_3->stop;
@@ -172,3 +172,34 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check that the state of standbys listed as a voter when the quroum
+# method is used.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'2 quorum and 1 async',
+'ANY 2(standby1, standby2)');
+
+# Check that state of standbys are not the same as the behaviour of that
+# 'ANY' is specified.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'not specify synchronization method',
+'2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'ANY 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+'all standbys are considered as candidates for quorum commit',
+'ANY 2(*)');

#50

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#49)

Re: Quorum commit for multiple synchronous replication.

On Mon, Dec 12, 2016 at 9:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Dec 10, 2016 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 8, 2016 at 3:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Dec 8, 2016 at 4:39 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

You could do that, but first I would code up the simplest, cleanest
algorithm you can think of and see if it even shows up in a 'perf'
profile. Microbenchmarking is probably overkill here unless a problem
is visible on macrobenchmarks.

This is what I would go for! The current code is doing a simple thing:
select the Nth element using qsort() after scanning each WAL sender's
values. And I think that Sawada-san got it right. Even running on my
laptop a pgbench run with 10 sync standbys using a data set that fits
into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
using perf top on a non-assert, non-debug build. Hash tables and
allocations get a far larger share. Using the patch,
SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
nodes. Let's kick the ball for now. An extra patch could make things
better later on if that's worth it.

Yeah, since the both K and N could be not large these algorithm takes
almost the same time. And current patch does simple thing. When we
need over 100 or 1000 replication node the optimization could be
required.
Attached latest v9 patch.

Few comments:

Thank you for reviewing.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
In the above description, is priority method represented by FIRST and
quorum method by ANY in the synchronous_standby_names syntax? If so,
it might be better to write about it explicitly.
Added description.
2.
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2637,16 +2637,8 @@ mergeruns(Tuplesortstate *state)
}
/*
- * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
- * from each input tape.
- */
- state->memtupsize = numInputTapes;
- state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
- USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
- /*
- * Use all the remaining memory we have available for read buffers among
- * the input tapes.
+ * Use all the spare memory we have available for read buffers among the
+ * input tapes.
This doesn't belong to this patch.
Oops, fixed.

3.
+ * Return the list of sync standbys using according to synchronous method,

In above sentence, "using according" seems to either incomplete or wrong usage.

Fixed.
4.
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ "invalid synchronization method is specified \"%d\"",
+ SyncRepConfig->sync_method));
Here, the error message doesn't seem to aligned and you might want to
use errmsg for the same.
Fixed.
5.
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
/oldest these/oldest of these
/newest these positions specified/newest of these positions as specified
Fixed.

Instead of newest, can we consider to use latest?

Yeah, I changed it so.

6.
+ int sync_method; /* synchronization method */
/* member_names contains nmembers consecutive nul-terminated C strings */
char member_names[FLEXIBLE_ARRAY_MEMBER];
} SyncRepConfigData;

Can't we use 1 or 2 bytes to store sync_method information?

I changed it to uint8.

Attached latest v10 patch incorporated the review comments so far.
Please review it.

Thanks for updating the patch!

Do we need to update postgresql.conf.sample?

+{any_ident}    {
+                yylval.str = pstrdup(yytext);
+                return ANY;
+        }
+{first_ident}    {
+                yylval.str = pstrdup(yytext);
+                return FIRST;
+        }

Why is pstrdup(yytext) necessary here?

+        standby_list                        { $$ =
create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+        | NUM '(' standby_list ')'            { $$ =
create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+        | ANY NUM '(' standby_list ')'        { $$ =
create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }

Isn't this "partial" backward-compatibility (i.e., "NUM (list)" works
differently from curent version while "list" works in the same way as
current one) very confusing?

I prefer to either of

1. break the backward-compatibility, i.e., treat the first syntax of
"standby_list" as quorum commit
2. keep the backward-compatibility, i.e., treat the second syntax of
"NUM (standby_list)" as sync rep with the priority

+        <literal>pg_stat_replication</></link> view). If the keyword
+        <literal>FIRST</> is specified, other standby servers appearing
+        later in this list represent potential synchronous standbys.
+        If any of the current synchronous standbys disconnects for
+        whatever reason, it will be replaced immediately with the
+        next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability.

It seems strange to explain the behavior of FIRST before explaining
the syntax of synchronous_standby_names and FIRST.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Fujii Masao (#50)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Mon, Dec 12, 2016 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Dec 12, 2016 at 9:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Dec 10, 2016 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 8, 2016 at 3:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Dec 8, 2016 at 4:39 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

You could do that, but first I would code up the simplest, cleanest
algorithm you can think of and see if it even shows up in a 'perf'
profile. Microbenchmarking is probably overkill here unless a problem
is visible on macrobenchmarks.

This is what I would go for! The current code is doing a simple thing:
select the Nth element using qsort() after scanning each WAL sender's
values. And I think that Sawada-san got it right. Even running on my
laptop a pgbench run with 10 sync standbys using a data set that fits
into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
using perf top on a non-assert, non-debug build. Hash tables and
allocations get a far larger share. Using the patch,
SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
nodes. Let's kick the ball for now. An extra patch could make things
better later on if that's worth it.

Yeah, since the both K and N could be not large these algorithm takes
almost the same time. And current patch does simple thing. When we
need over 100 or 1000 replication node the optimization could be
required.
Attached latest v9 patch.

Few comments:

Thank you for reviewing.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
In the above description, is priority method represented by FIRST and
quorum method by ANY in the synchronous_standby_names syntax? If so,
it might be better to write about it explicitly.
Added description.
2.
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2637,16 +2637,8 @@ mergeruns(Tuplesortstate *state)
}
/*
- * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
- * from each input tape.
- */
- state->memtupsize = numInputTapes;
- state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
- USEMEM(state, GetMemoryChunkSpace(state->memtuples));
-
- /*
- * Use all the remaining memory we have available for read buffers among
- * the input tapes.
+ * Use all the spare memory we have available for read buffers among the
+ * input tapes.
This doesn't belong to this patch.
Oops, fixed.

3.
+ * Return the list of sync standbys using according to synchronous method,

In above sentence, "using according" seems to either incomplete or wrong usage.

Fixed.
4.
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ "invalid synchronization method is specified \"%d\"",
+ SyncRepConfig->sync_method));
Here, the error message doesn't seem to aligned and you might want to
use errmsg for the same.
Fixed.
5.
+ * In priority method, we need the oldest these positions among sync
+ * standbys. In quorum method, we need the newest these positions
+ * specified by SyncRepConfig->num_sync.
/oldest these/oldest of these
/newest these positions specified/newest of these positions as specified
Fixed.

Instead of newest, can we consider to use latest?

Yeah, I changed it so.

6.
+ int sync_method; /* synchronization method */
/* member_names contains nmembers consecutive nul-terminated C strings */
char member_names[FLEXIBLE_ARRAY_MEMBER];
} SyncRepConfigData;

Can't we use 1 or 2 bytes to store sync_method information?

I changed it to uint8.

Attached latest v10 patch incorporated the review comments so far.
Please review it.
Thanks for updating the patch!

Do we need to update postgresql.conf.sample?

Added description to postgresql.conf.sample.

+{any_ident}    {
+                yylval.str = pstrdup(yytext);
+                return ANY;
+        }
+{first_ident}    {
+                yylval.str = pstrdup(yytext);
+                return FIRST;
+        }

Why is pstrdup(yytext) necessary here?

The first whole line was unnecessary actually. Removed.

+        standby_list                        { $$ =
create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+        | NUM '(' standby_list ')'            { $$ =
create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+        | ANY NUM '(' standby_list ')'        { $$ =
create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
Isn't this "partial" backward-compatibility (i.e., "NUM (list)" works
differently from curent version while "list" works in the same way as
current one) very confusing?

I prefer to either of

1. break the backward-compatibility, i.e., treat the first syntax of
"standby_list" as quorum commit
2. keep the backward-compatibility, i.e., treat the second syntax of
"NUM (standby_list)" as sync rep with the priority

There were some comments when I proposed the quorum commit. If we do
#1 it breaks the backward-compatibility with 9.5 or before as well. I
don't think it's a good idea. On the other hand, if we do #2 then the
behaviour of s_s_name is 'NUM (standby_list)' == 'FIRST NUM
(standby_list)''. But it would not what most of user will want and
would confuse the users of future version who will want to use the
quorum commit. Since many hackers thought that the sensible default
behaviour is 'NUM (standby_list)' == 'ANY NUM (standby_list)' the
current patch chose to changes the behaviour of s_s_names and document
that changes thoroughly.

+        <literal>pg_stat_replication</></link> view). If the keyword
+        <literal>FIRST</> is specified, other standby servers appearing
+        later in this list represent potential synchronous standbys.
+        If any of the current synchronous standbys disconnects for
+        whatever reason, it will be replaced immediately with the
+        next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability.

It seems strange to explain the behavior of FIRST before explaining
the syntax of synchronous_standby_names and FIRST.

Updated document.

Attached latest v11 patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_quorum_commit_v11.patchapplication/octet-stream; name=000_quorum_commit_v11.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0fc4e57..8bc8cd9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3054,42 +3054,77 @@ include_dir 'conf.d'
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
         The synchronous standbys will be those whose names appear
-        earlier in this list, and
+        in this list, and
         that are both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[ANY] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+FIRST <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
-        </para>
-        <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
+       </para>
+       <para>
+        The keyword <literal>FIRST</>, coupled with <literal>num_sync</>, makes
+        transaction commit wait until WAL records are received from the
+        <literal>num_sync</> standbys with higher priority number.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        The other standby servers appearing later in list represent potential
+        synchronous standbys. If any of the current synchronous standbys
+        disconnects for whatever reason, it will be replaced immediately
+        with the next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability.
+       </para>
+       <para>
+        The keyword <literal>ANY</>, coupled with <literal>num_sync</>,
+        makes transaction commits wait until WAL records are received
+        from at least <literal>num_sync</> connected standbys among those
+        defined in the list of <varname>synchronous_standby_names</>. For
+        example, a setting of <literal>ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving WAL records from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>. The transaction
+        can continue to proceed as long as <literal>num_sync</> standbys
+        live. Specifying more than one standby name can allow very high
+        availability.
+       </para>
+       <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive words
+        and the standby name having these words are must be double-quoted.
+       </para>
+       <para>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
+        with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+        equal to 1. For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
        </para>
+       <note>
+         <para>
+         If <literal>FIRST</> or <literal>ANY</> are not specified, this parameter
+         behaves as if <literal>ANY</> is used. Note that this grammar is incompatible
+         with <productname>PostgresSQL</> 9.6 which is first version supporting multiple
+         standbys with synchronous replication, where no such keyword <literal>FIRST</>
+         or <literal>ANY</> can be used. Note that the grammer behaves as if <literal>FIRST</>
+         is used, which is incompatible with the post-9.6 version behavior.
+        </para>
+       </note>
        <para>
         The name of a standby server for this purpose is the
         <varname>application_name</> setting of the standby, as set in the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6b89507..26e3c4e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1165,6 +1165,18 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     The synchronous states of standby servers can be viewed using
     the <structname>pg_stat_replication</structname> view.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys <literal>s1</>,
+    <literal>s2</> and <literal>s3</> will be considered as synchronous standby
+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 128ee13..771787d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1437,6 +1437,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
            <literal>sync</>: This standby server is synchronous.
           </para>
          </listitem>
+         <listitem>
+         <para>
+          <literal>quorum</>: This standby is considered as a candidate of quorum commit.
+         </para>
+         </listitem>
        </itemizedlist>
      </entry>
     </row>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..75649f1 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -31,16 +31,21 @@
  *
  * In 9.5 or before only a single standby could be considered as
  * synchronous. In 9.6 we support multiple synchronous standbys.
- * The number of synchronous standbys that transactions must wait for
- * replies from is specified in synchronous_standby_names.
- * This parameter also specifies a list of standby names,
- * which determines the priority of each standby for being chosen as
- * a synchronous standby. The standbys whose names appear earlier
- * in the list are given higher priority and will be considered as
- * synchronous. Other standby servers appearing later in this list
- * represent potential synchronous standbys. If any of the current
- * synchronous standbys disconnects for whatever reason, it will be
- * replaced immediately with the next-highest-priority standby.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are
+ * specified in synchronous_standby_names. The priority method is
+ * represented by FIRST, and the quorum method is represented by ANY
+ * This parameter also specifies a list of standby names, which
+ * determines the priority of each standby for being chosen as a
+ * synchronous standby. In priority method, the standbys whose names
+ * appear earlier in the list are given higher priority and will be
+ * considered as synchronous. Other standby servers appearing later
+ * in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
  *
  * Before the standbys chosen from synchronous_standby_names can
  * become the synchronous standbys they must have caught up with
@@ -73,24 +78,27 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+SyncRepConfigData *SyncRepConfig = NULL;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
 static bool announce_next_takeover = true;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+								 XLogRecPtr *flushPtr,
+								 XLogRecPtr *applyPtr,
+								 bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +394,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -413,11 +421,10 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +442,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -471,17 +478,50 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Return the list of sync standbys according to synchronous method, or
+ * reutrn NIL if no sync standby is connected. The caller must hold SyncRepLock.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid synchronization method is specified \"%d\"",
+						SyncRepConfig->sync_method)));
+}
+
+/*
+ * Calculate the Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, we need the oldest of these positions among sync
+ * standbys. In quorum method, we need the latest of these positions
+ * as specified by SyncRepConfig->num_sync.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr, bool *am_sync)
 {
 	List	   *sync_standbys;
@@ -507,29 +547,74 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 		return false;
 	}
 
-	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
-	 */
-	foreach(cell, sync_standbys)
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
-		XLogRecPtr	write;
-		XLogRecPtr	flush;
-		XLogRecPtr	apply;
-
-		SpinLockAcquire(&walsnd->mutex);
-		write = walsnd->write;
-		flush = walsnd->flush;
-		apply = walsnd->apply;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
-			*writePtr = write;
-		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
-			*flushPtr = flush;
-		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
-			*applyPtr = apply;
+		/*
+		 * Scan through all sync standbys and calculate the oldest
+		 * Write, Flush and Apply positions.
+		 */
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+			XLogRecPtr	write;
+			XLogRecPtr	flush;
+			XLogRecPtr	apply;
+
+			SpinLockAcquire(&walsnd->mutex);
+			write = walsnd->write;
+			flush = walsnd->flush;
+			apply = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+				*writePtr = write;
+			if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+				*flushPtr = flush;
+			if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+				*applyPtr = apply;
+		}
+	}
+	else /* SYNC_REP_QUORUM */
+	{
+		XLogRecPtr	*write_array;
+		XLogRecPtr	*flush_array;
+		XLogRecPtr	*apply_array;
+		int len;
+		int i = 0;
+
+		len = list_length(sync_standbys);
+		write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+			SpinLockAcquire(&walsnd->mutex);
+			write_array[i] = walsnd->write;
+			flush_array[i] = walsnd->flush;
+			apply_array[i] = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			i++;
+		}
+
+		qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+		/*
+		 * Get N-th latest Write, Flush, Apply positions
+		 * specified by SyncRepConfig->num_sync.
+		 */
+		*writePtr = write_array[SyncRepConfig->num_sync - 1];
+		*flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+		*applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+		pfree(write_array);
+		pfree(flush_array);
+		pfree(apply_array);
 	}
 
 	list_free(sync_standbys);
@@ -537,17 +622,66 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using quorum method, or return
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
  *
- * If there are multiple standbys with the same priority,
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
  * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
-List *
-SyncRepGetSyncStandbys(bool *am_sync)
+static List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -560,13 +694,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
+	Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
 
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
@@ -892,6 +1020,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..e10be8b 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, int sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -77,7 +79,7 @@ standby_name:
 
 
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, int sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +100,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..c08e95b 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,8 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{ return ANY; }
+{first_ident}	{ return FIRST; }
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b14d821..83f4e7c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2868,12 +2868,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 			/*
 			 * More easily understood version of standby state. This is purely
-			 * informational, not different from priority.
+			 * informational. In quorum method, since all standbys are considered as
+			 * a candidate of quorum commit standby state is  always 'quorum'.
 			 */
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7f9acfd..b332247 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -245,7 +245,8 @@
 # These settings are ignored on a standby server.
 
 #synchronous_standby_names = ''	# standby servers that provide sync rep
-				# number of sync standbys and comma-separated list of application_name
+				# synchronization method, number of sync standbys
+				# and comma-separated list of application_name
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..5ceb4b9 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	int8		sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..e893ba0 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 11;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -107,7 +107,7 @@ test_sync_state(
 	$node_master, qq(standby2|2|sync
 standby3|3|sync),
 	'2 synchronous standbys',
-	'2(standby1,standby2,standby3)');
+	'FIRST 2(standby1,standby2,standby3)');
 
 # Start standby1
 $node_standby_1->start;
@@ -138,7 +138,7 @@ standby2|4|sync
 standby3|3|sync
 standby4|1|sync),
 	'num_sync exceeds the num of potential sync standbys',
-	'6(standby4,standby0,standby3,standby2)');
+	'FIRST 6(standby4,standby0,standby3,standby2)');
 
 # The setting that * comes before another standby name is acceptable
 # but does not make sense in most cases. Check that sync_state is
@@ -150,7 +150,7 @@ standby2|2|sync
 standby3|2|potential
 standby4|2|potential),
 	'asterisk comes before another standby name',
-	'2(standby1,*,standby2)');
+	'FIRST 2(standby1,*,standby2)');
 
 # Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
 # earlier in WalSnd array as sync standbys.
@@ -160,7 +160,7 @@ standby2|1|sync
 standby3|1|sync
 standby4|1|potential),
 	'multiple standbys having the same priority are chosen as sync',
-	'2(*)');
+	'FIRST 2(*)');
 
 # Stop Standby3 which is considered in 'sync' state.
 $node_standby_3->stop;
@@ -172,3 +172,34 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check that the state of standbys listed as a voter when the quroum
+# method is used.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'2 quorum and 1 async',
+'ANY 2(standby1, standby2)');
+
+# Check that state of standbys are not the same as the behaviour of that
+# 'ANY' is specified.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'not specify synchronization method',
+'2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'ANY 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+'all standbys are considered as candidates for quorum commit',
+'ANY 2(*)');

#52

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#51)

Re: Quorum commit for multiple synchronous replication.

On Mon, Dec 12, 2016 at 9:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 12, 2016 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Dec 12, 2016 at 9:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Dec 10, 2016 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments:

Thank you for reviewing.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
In the above description, is priority method represented by FIRST and
quorum method by ANY in the synchronous_standby_names syntax? If so,
it might be better to write about it explicitly.
Added description.

+ * specified in synchronous_standby_names. The priority method is
+ * represented by FIRST, and the quorum method is represented by ANY

Full stop is missing after "ANY".

6.
+ int sync_method; /* synchronization method */
/* member_names contains nmembers consecutive nul-terminated C strings */
char member_names[FLEXIBLE_ARRAY_MEMBER];
} SyncRepConfigData;

Can't we use 1 or 2 bytes to store sync_method information?

I changed it to uint8.

+ int8 sync_method; /* synchronization method */

I changed it to uint8.

In mail, you have mentioned uint8, but in the code it is int8. I
think you want to update code to use uint8.

+        standby_list                        { $$ =
create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+        | NUM '(' standby_list ')'            { $$ =
create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+        | ANY NUM '(' standby_list ')'        { $$ =
create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
Isn't this "partial" backward-compatibility (i.e., "NUM (list)" works
differently from curent version while "list" works in the same way as
current one) very confusing?

I prefer to either of

1. break the backward-compatibility, i.e., treat the first syntax of
"standby_list" as quorum commit
2. keep the backward-compatibility, i.e., treat the second syntax of
"NUM (standby_list)" as sync rep with the priority

+1.

There were some comments when I proposed the quorum commit. If we do
#1 it breaks the backward-compatibility with 9.5 or before as well. I
don't think it's a good idea. On the other hand, if we do #2 then the
behaviour of s_s_name is 'NUM (standby_list)' == 'FIRST NUM
(standby_list)''. But it would not what most of user will want and
would confuse the users of future version who will want to use the
quorum commit. Since many hackers thought that the sensible default
behaviour is 'NUM (standby_list)' == 'ANY NUM (standby_list)' the
current patch chose to changes the behaviour of s_s_names and document
that changes thoroughly.

Your arguments are sensible, but I think we should address the point
of confusion raised by Fujii-san. As a developer, I feel breaking
backward compatibility (go with Option-1 mentioned above) here is a
good move as it can avoid confusions in future. However, I know many
a time users are so accustomed to the current usage that they feel
irritated with the change in behavior even it is for their betterment,
so it is advisable to do so only if it is necessary or we have
feedback from a couple of users. So in this case, if we don't want to
go with Option-1, then I think we should go with Option-2. If we go
with Option-2, then we can anyway comeback to change the behavior
which is more sensible for future after feedback from users.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Amit Kapila (#52)

Re: Quorum commit for multiple synchronous replication.

At Tue, 13 Dec 2016 08:46:06 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1JoheFzO1rrKm391wJDepFvZr1geRh4ZJ_9iC4FOX+3Uw@mail.gmail.com>

On Mon, Dec 12, 2016 at 9:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Dec 12, 2016 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Dec 12, 2016 at 9:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Dec 10, 2016 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments:

Thank you for reviewing.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
In the above description, is priority method represented by FIRST and
quorum method by ANY in the synchronous_standby_names syntax? If so,
it might be better to write about it explicitly.
Added description.
+ * specified in synchronous_standby_names. The priority method is
+ * represented by FIRST, and the quorum method is represented by ANY
Full stop is missing after "ANY".

6.
+ int sync_method; /* synchronization method */
/* member_names contains nmembers consecutive nul-terminated C strings */
char member_names[FLEXIBLE_ARRAY_MEMBER];
} SyncRepConfigData;

Can't we use 1 or 2 bytes to store sync_method information?

I changed it to uint8.

+ int8 sync_method; /* synchronization method */

I changed it to uint8.

In mail, you have mentioned uint8, but in the code it is int8. I
think you want to update code to use uint8.
+        standby_list                        { $$ =
create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+        | NUM '(' standby_list ')'            { $$ =
create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+        | ANY NUM '(' standby_list ')'        { $$ =
create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
Isn't this "partial" backward-compatibility (i.e., "NUM (list)" works
differently from curent version while "list" works in the same way as
current one) very confusing?

I prefer to either of

1. break the backward-compatibility, i.e., treat the first syntax of
"standby_list" as quorum commit
2. keep the backward-compatibility, i.e., treat the second syntax of
"NUM (standby_list)" as sync rep with the priority
+1.

There were some comments when I proposed the quorum commit. If we do
#1 it breaks the backward-compatibility with 9.5 or before as well. I
don't think it's a good idea. On the other hand, if we do #2 then the
behaviour of s_s_name is 'NUM (standby_list)' == 'FIRST NUM
(standby_list)''. But it would not what most of user will want and
would confuse the users of future version who will want to use the
quorum commit. Since many hackers thought that the sensible default
behaviour is 'NUM (standby_list)' == 'ANY NUM (standby_list)' the
current patch chose to changes the behaviour of s_s_names and document
that changes thoroughly.

Your arguments are sensible, but I think we should address the point
of confusion raised by Fujii-san. As a developer, I feel breaking
backward compatibility (go with Option-1 mentioned above) here is a
good move as it can avoid confusions in future. However, I know many
a time users are so accustomed to the current usage that they feel
irritated with the change in behavior even it is for their betterment,
so it is advisable to do so only if it is necessary or we have
feedback from a couple of users. So in this case, if we don't want to
go with Option-1, then I think we should go with Option-2. If we go
with Option-2, then we can anyway comeback to change the behavior
which is more sensible for future after feedback from users.

This implicitly put an assumption that replication configuration
is defined by s_s_names. But in the past discussion, some people
suggested that quorum commit should be configured by another GUC
variable and I think it is the time to do this now.

So, we have the third option that would be like the following.

- s_s_names is restored to work in the way as of 9.5. or may
be the same as 9.6. Or immediately remove it! My inclination
is doing this.

- a new GUC varialbe such like "quorum_commit_standbys" (which
is exclusive to s_s_names) is defined for new version of
quorum commit feature. The option-1 except "standby_list"
format is usable in this.

This doesn't puzzle users who don't read release notes carefully
(ME?). Leaving s_s_names can save some of such users but I don't
think it is requried at Pg10.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#53)

Re: Quorum commit for multiple synchronous replication.

On Tue, Dec 13, 2016 at 5:06 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Tue, 13 Dec 2016 08:46:06 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1JoheFzO1rrKm391wJDepFvZr1geRh4ZJ_9iC4FOX+3Uw@mail.gmail.com>
On Mon, Dec 12, 2016 at 9:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Dec 12, 2016 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Dec 12, 2016 at 9:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sat, Dec 10, 2016 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Few comments:

Thank you for reviewing.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are specified
+ * in synchronous_standby_names. This parameter also specifies a list
+ * of standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the standbys
+ * whose names appear earlier in the list are given higher priority
+ * and will be considered as synchronous. Other standby servers appearing
+ * later in this list represent potential synchronous standbys. If any of
+ * the current synchronous standbys disconnects for whatever reason,
+ * it will be replaced immediately with the next-highest-priority standby.
+ * In quorum method, the all standbys appearing in the list are
+ * considered as a candidate for quorum commit.
In the above description, is priority method represented by FIRST and
quorum method by ANY in the synchronous_standby_names syntax? If so,
it might be better to write about it explicitly.
Added description.
+ * specified in synchronous_standby_names. The priority method is
+ * represented by FIRST, and the quorum method is represented by ANY
Full stop is missing after "ANY".

6.
+ int sync_method; /* synchronization method */
/* member_names contains nmembers consecutive nul-terminated C strings */
char member_names[FLEXIBLE_ARRAY_MEMBER];
} SyncRepConfigData;

Can't we use 1 or 2 bytes to store sync_method information?

I changed it to uint8.

+ int8 sync_method; /* synchronization method */

I changed it to uint8.

In mail, you have mentioned uint8, but in the code it is int8. I
think you want to update code to use uint8.
+        standby_list                        { $$ =
create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+        | NUM '(' standby_list ')'            { $$ =
create_syncrep_config($1, $3, SYNC_REP_QUORUM); }
+        | ANY NUM '(' standby_list ')'        { $$ =
create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+        | FIRST NUM '(' standby_list ')'    { $$ =
create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
Isn't this "partial" backward-compatibility (i.e., "NUM (list)" works
differently from curent version while "list" works in the same way as
current one) very confusing?

I prefer to either of

1. break the backward-compatibility, i.e., treat the first syntax of
"standby_list" as quorum commit
2. keep the backward-compatibility, i.e., treat the second syntax of
"NUM (standby_list)" as sync rep with the priority
+1.

There were some comments when I proposed the quorum commit. If we do
#1 it breaks the backward-compatibility with 9.5 or before as well. I
don't think it's a good idea. On the other hand, if we do #2 then the
behaviour of s_s_name is 'NUM (standby_list)' == 'FIRST NUM
(standby_list)''. But it would not what most of user will want and
would confuse the users of future version who will want to use the
quorum commit. Since many hackers thought that the sensible default
behaviour is 'NUM (standby_list)' == 'ANY NUM (standby_list)' the
current patch chose to changes the behaviour of s_s_names and document
that changes thoroughly.

Your arguments are sensible, but I think we should address the point
of confusion raised by Fujii-san. As a developer, I feel breaking
backward compatibility (go with Option-1 mentioned above) here is a
good move as it can avoid confusions in future. However, I know many
a time users are so accustomed to the current usage that they feel
irritated with the change in behavior even it is for their betterment,
so it is advisable to do so only if it is necessary or we have
feedback from a couple of users. So in this case, if we don't want to
go with Option-1, then I think we should go with Option-2. If we go
with Option-2, then we can anyway comeback to change the behavior
which is more sensible for future after feedback from users.
This implicitly put an assumption that replication configuration
is defined by s_s_names. But in the past discussion, some people
suggested that quorum commit should be configured by another GUC
variable and I think it is the time to do this now.

So, we have the third option that would be like the following.

- s_s_names is restored to work in the way as of 9.5. or may
be the same as 9.6. Or immediately remove it! My inclination
is doing this.

- a new GUC varialbe such like "quorum_commit_standbys" (which
is exclusive to s_s_names) is defined for new version of
quorum commit feature. The option-1 except "standby_list"
format is usable in this.

If we drop the "standby_list" syntax, I don't think that new parameter is
necessary. We can keep s_s_names and just drop the support for that syntax
from s_s_names. This may be ok if we're really in "break all the things" mode
for PostgreSQL 10.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Fujii Masao (#54)

Re: Quorum commit for multiple synchronous replication.

On Wed, Dec 14, 2016 at 11:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

If we drop the "standby_list" syntax, I don't think that new parameter is
necessary. We can keep s_s_names and just drop the support for that syntax
from s_s_names. This may be ok if we're really in "break all the things" mode
for PostgreSQL 10.

Please let's not raise that as an argument again... And not break the
s_list argument. Many users depend on that for just single sync
standbys. FWIW, I'd be in favor of backward compatibility and say that
a standby list is a priority list if we can maintain that. Upthread
agreement was to break that, I did not insist further, and won't if
that's still the feeling.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Michael Paquier (#55)

Re: Quorum commit for multiple synchronous replication.

On Thu, Dec 15, 2016 at 6:47 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 14, 2016 at 11:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

If we drop the "standby_list" syntax, I don't think that new parameter is
necessary. We can keep s_s_names and just drop the support for that syntax
from s_s_names. This may be ok if we're really in "break all the things" mode
for PostgreSQL 10.

Please let's not raise that as an argument again... And not break the
s_list argument. Many users depend on that for just single sync
standbys. FWIW, I'd be in favor of backward compatibility and say that
a standby list is a priority list if we can maintain that. Upthread
agreement was to break that, I did not insist further, and won't if
that's still the feeling.

I wonder why you think that the backward-compatibility for standby_list is
so "special". We renamed pg_xlog directory to pg_wal and are planning to
change recovery.conf API at all, though they have bigger impacts on
the existing users in terms of the backward compatibility. OTOH, so far,
changing GUC between major releases happened several times.

But I'm not against keeping the backward compatibility for standby_list,
to be honest. My concern is that the latest patch tries to support
the backward compatibility "partially" and which would be confusing to users,
as I told upthread.

So I'd like to propose to keep the backward compatibility fully for s_s_names
(i.e., both "standby_list" and "N (standby_list)" mean the priority method)
at the first commit, then continue discussing this and change it if we reach
the consensus until PostgreSQL 10 is actually released. Thought?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Fujii Masao (#56)

Re: Quorum commit for multiple synchronous replication.

On Thu, Dec 15, 2016 at 11:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Dec 15, 2016 at 6:47 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 14, 2016 at 11:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

If we drop the "standby_list" syntax, I don't think that new parameter is
necessary. We can keep s_s_names and just drop the support for that syntax
from s_s_names. This may be ok if we're really in "break all the things" mode
for PostgreSQL 10.

Please let's not raise that as an argument again... And not break the
s_list argument. Many users depend on that for just single sync
standbys. FWIW, I'd be in favor of backward compatibility and say that
a standby list is a priority list if we can maintain that. Upthread
agreement was to break that, I did not insist further, and won't if
that's still the feeling.

I wonder why you think that the backward-compatibility for standby_list is
so "special". We renamed pg_xlog directory to pg_wal and are planning to
change recovery.conf API at all, though they have bigger impacts on
the existing users in terms of the backward compatibility. OTOH, so far,
changing GUC between major releases happened several times.

Silent failures for existing failover deployments is a pain to solve
after doing upgrades. That's my only concern. Changing pg_wal would
result in a hard failure when upgrading. And changing the meaning of
the standby list (without keyword ANY or FIRST!) does not fall into
that category... So yes just removing support for standby list would
result in a hard failure, which would be fine for the
let-s-break-all-things move.

But I'm not against keeping the backward compatibility for standby_list,
to be honest. My concern is that the latest patch tries to support
the backward compatibility "partially" and which would be confusing to users,
as I told upthread.

If we try to support backward compatibility, I'd personally do it
fully, and have a list of standby names specified meaning a priority
list.

So I'd like to propose to keep the backward compatibility fully for s_s_names
(i.e., both "standby_list" and "N (standby_list)" mean the priority method)
at the first commit, then continue discussing this and change it if we reach
the consensus until PostgreSQL 10 is actually released. Thought?

+1 on that.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Michael Paquier (#57)

Re: Quorum commit for multiple synchronous replication.

On Thu, Dec 15, 2016 at 7:53 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 15, 2016 at 11:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Dec 15, 2016 at 6:47 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

So I'd like to propose to keep the backward compatibility fully for s_s_names
(i.e., both "standby_list" and "N (standby_list)" mean the priority method)
at the first commit, then continue discussing this and change it if we reach
the consensus until PostgreSQL 10 is actually released. Thought?

+1 on that.

+1. That is the safest option to proceed.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Michael Paquier (#57)

Re: Quorum commit for multiple synchronous replication.

On Thu, Dec 15, 2016 at 11:23 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 15, 2016 at 11:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Dec 15, 2016 at 6:47 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 14, 2016 at 11:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

If we drop the "standby_list" syntax, I don't think that new parameter is
necessary. We can keep s_s_names and just drop the support for that syntax
from s_s_names. This may be ok if we're really in "break all the things" mode
for PostgreSQL 10.

Please let's not raise that as an argument again... And not break the
s_list argument. Many users depend on that for just single sync
standbys. FWIW, I'd be in favor of backward compatibility and say that
a standby list is a priority list if we can maintain that. Upthread
agreement was to break that, I did not insist further, and won't if
that's still the feeling.

I wonder why you think that the backward-compatibility for standby_list is
so "special". We renamed pg_xlog directory to pg_wal and are planning to
change recovery.conf API at all, though they have bigger impacts on
the existing users in terms of the backward compatibility. OTOH, so far,
changing GUC between major releases happened several times.

Silent failures for existing failover deployments is a pain to solve
after doing upgrades. That's my only concern. Changing pg_wal would
result in a hard failure when upgrading. And changing the meaning of
the standby list (without keyword ANY or FIRST!) does not fall into
that category... So yes just removing support for standby list would
result in a hard failure, which would be fine for the
let-s-break-all-things move.

But I'm not against keeping the backward compatibility for standby_list,
to be honest. My concern is that the latest patch tries to support
the backward compatibility "partially" and which would be confusing to users,
as I told upthread.

If we try to support backward compatibility, I'd personally do it
fully, and have a list of standby names specified meaning a priority
list.

So I'd like to propose to keep the backward compatibility fully for s_s_names
(i.e., both "standby_list" and "N (standby_list)" mean the priority method)
at the first commit, then continue discussing this and change it if we reach
the consensus until PostgreSQL 10 is actually released. Thought?

+1 on that.

+1.
I'll update the patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Masahiko Sawada (#59)

Re: Quorum commit for multiple synchronous replication.

At Thu, 15 Dec 2016 14:20:53 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoDn73aC+o0mrWCs800LeOsMYP4oV7xVb0T0_4V5VCQzhQ@mail.gmail.com>

On Thu, Dec 15, 2016 at 11:23 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 15, 2016 at 11:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Dec 15, 2016 at 6:47 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 14, 2016 at 11:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

If we drop the "standby_list" syntax, I don't think that new parameter is
necessary. We can keep s_s_names and just drop the support for that syntax
from s_s_names. This may be ok if we're really in "break all the things" mode
for PostgreSQL 10.

Please let's not raise that as an argument again... And not break the
s_list argument. Many users depend on that for just single sync
standbys. FWIW, I'd be in favor of backward compatibility and say that
a standby list is a priority list if we can maintain that. Upthread
agreement was to break that, I did not insist further, and won't if
that's still the feeling.

I wonder why you think that the backward-compatibility for standby_list is
so "special". We renamed pg_xlog directory to pg_wal and are planning to
change recovery.conf API at all, though they have bigger impacts on
the existing users in terms of the backward compatibility. OTOH, so far,
changing GUC between major releases happened several times.

Silent failures for existing failover deployments is a pain to solve
after doing upgrades. That's my only concern. Changing pg_wal would
result in a hard failure when upgrading. And changing the meaning of
the standby list (without keyword ANY or FIRST!) does not fall into
that category... So yes just removing support for standby list would
result in a hard failure, which would be fine for the
let-s-break-all-things move.

But I'm not against keeping the backward compatibility for standby_list,
to be honest. My concern is that the latest patch tries to support
the backward compatibility "partially" and which would be confusing to users,
as I told upthread.

If we try to support backward compatibility, I'd personally do it
fully, and have a list of standby names specified meaning a priority
list.

So I'd like to propose to keep the backward compatibility fully for s_s_names
(i.e., both "standby_list" and "N (standby_list)" mean the priority method)
at the first commit, then continue discussing this and change it if we reach
the consensus until PostgreSQL 10 is actually released. Thought?

+1 on that.

+1.

FWIW, +1 from me.

I'll update the patch.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#60)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Thu, Dec 15, 2016 at 3:06 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Thu, 15 Dec 2016 14:20:53 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoDn73aC+o0mrWCs800LeOsMYP4oV7xVb0T0_4V5VCQzhQ@mail.gmail.com>

On Thu, Dec 15, 2016 at 11:23 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 15, 2016 at 11:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Dec 15, 2016 at 6:47 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Dec 14, 2016 at 11:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

If we drop the "standby_list" syntax, I don't think that new parameter is
necessary. We can keep s_s_names and just drop the support for that syntax
from s_s_names. This may be ok if we're really in "break all the things" mode
for PostgreSQL 10.

Please let's not raise that as an argument again... And not break the
s_list argument. Many users depend on that for just single sync
standbys. FWIW, I'd be in favor of backward compatibility and say that
a standby list is a priority list if we can maintain that. Upthread
agreement was to break that, I did not insist further, and won't if
that's still the feeling.

I wonder why you think that the backward-compatibility for standby_list is
so "special". We renamed pg_xlog directory to pg_wal and are planning to
change recovery.conf API at all, though they have bigger impacts on
the existing users in terms of the backward compatibility. OTOH, so far,
changing GUC between major releases happened several times.

Silent failures for existing failover deployments is a pain to solve
after doing upgrades. That's my only concern. Changing pg_wal would
result in a hard failure when upgrading. And changing the meaning of
the standby list (without keyword ANY or FIRST!) does not fall into
that category... So yes just removing support for standby list would
result in a hard failure, which would be fine for the
let-s-break-all-things move.

But I'm not against keeping the backward compatibility for standby_list,
to be honest. My concern is that the latest patch tries to support
the backward compatibility "partially" and which would be confusing to users,
as I told upthread.

If we try to support backward compatibility, I'd personally do it
fully, and have a list of standby names specified meaning a priority
list.

So I'd like to propose to keep the backward compatibility fully for s_s_names
(i.e., both "standby_list" and "N (standby_list)" mean the priority method)
at the first commit, then continue discussing this and change it if we reach
the consensus until PostgreSQL 10 is actually released. Thought?

+1 on that.

+1.

FWIW, +1 from me.

I'll update the patch.

Attached latest v12 patch.
I changed behavior of "N (standby_list)" to use the priority method
and incorporated some review comments so far. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_quorum_commit_v12.patchtext/x-diff; charset=US-ASCII; name=000_quorum_commit_v12.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0fc4e57..91eb888 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3054,41 +3054,67 @@ include_dir 'conf.d'
         transactions waiting for commit will be allowed to proceed after
         these standby servers confirm receipt of their data.
         The synchronous standbys will be those whose names appear
-        earlier in this list, and
+        in this list, and
         that are both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
-        Other standby servers appearing later in this list represent potential
-        synchronous standbys. If any of the current synchronous
-        standbys disconnects for whatever reason,
-        it will be replaced immediately with the next-highest-priority standby.
-        Specifying more than one standby name can allow very high availability.
        </para>
        <para>
         This parameter specifies a list of standby servers using
         either of the following syntaxes:
 <synopsis>
-<replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+[FIRST] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
+ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
 <replaceable class="parameter">standby_name</replaceable> [, ...]
 </synopsis>
         where <replaceable class="parameter">num_sync</replaceable> is
         the number of synchronous standbys that transactions need to
         wait for replies from,
         and <replaceable class="parameter">standby_name</replaceable>
-        is the name of a standby server. For example, a setting of
-        <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
-        until their WAL records are received by three higher-priority standbys
-        chosen from standby servers <literal>s1</>, <literal>s2</>,
-        <literal>s3</> and <literal>s4</>.
-        </para>
-        <para>
-        The second syntax was used before <productname>PostgreSQL</>
+        is the name of a standby server.
+        <literal>FIRST</> and <literal>ANY</> specify the method used by
+        the master to control the standby servres.
+       </para>
+       <para>
+        The keyword <literal>FIRST</>, coupled with <literal>num_sync</>, makes
+        transaction commit wait until WAL records are received from the
+        <literal>num_sync</> standbys with higher priority number.
+        For example, a setting of <literal>FIRST 3 (s1, s2, s3, s4)</>
+        makes transaction commits wait until their WAL records are received
+        by three higher-priority standbys chosen from standby servers
+        <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
+        The other standby servers appearing later in list represent potential
+        synchronous standbys. If any of the current synchronous standbys
+        disconnects for whatever reason, it will be replaced immediately
+        with the next-highest-priority standby. Specifying more than one standby
+        name can allow very high availability. The keyword <literal>FIRST</>
+        is optional.
+       </para>
+       <para>
+        The keyword <literal>ANY</>, coupled with <literal>num_sync</>,
+        makes transaction commits wait until WAL records are received
+        from at least <literal>num_sync</> connected standbys among those
+        defined in the list of <varname>synchronous_standby_names</>. For
+        example, a setting of <literal>ANY 3 (s1, s2, s3, s4)</> makes
+        transaction commits wait until receiving WAL records from at least
+        any three standbys of four listed servers <literal>s1</>,
+        <literal>s2</>, <literal>s3</>, <literal>s4</>. The transaction
+        can continue to proceed as long as <literal>num_sync</> standbys
+        live. Specifying more than one standby name can allow very high
+        availability.
+       </para>
+       <para>
+        <literal>FIRST</> and <literal>ANY</> are case-insensitive words
+        and the standby name having these words are must be double-quoted.
+       </para>
+       <para>
+        The third syntax was used before <productname>PostgreSQL</>
         version 9.6 and is still supported. It's the same as the first syntax
-        with <replaceable class="parameter">num_sync</replaceable> equal to 1.
-        For example, <literal>1 (s1, s2)</> and
-        <literal>s1, s2</> have the same meaning: either <literal>s1</>
-        or <literal>s2</> is chosen as a synchronous standby.
+        with <literal>FIRST</> and <replaceable class="parameter">num_sync</replaceable>
+        equal to 1. For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</>
+        have the same meaning: either <literal>s1</> or <literal>s2</> is
+        chosen as a synchronous standby.
        </para>
        <para>
         The name of a standby server for this purpose is the
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 6b89507..26e3c4e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1150,7 +1150,7 @@ primary_slot_name = 'node_a_slot'
     An example of <varname>synchronous_standby_names</> for multiple
     synchronous standbys is:
 <programlisting>
-synchronous_standby_names = '2 (s1, s2, s3)'
+synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
 </programlisting>
     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
     <literal>s3</> and <literal>s4</> are running, the two standbys
@@ -1165,6 +1165,18 @@ synchronous_standby_names = '2 (s1, s2, s3)'
     The synchronous states of standby servers can be viewed using
     the <structname>pg_stat_replication</structname> view.
    </para>
+   <para>
+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
+<programlisting>
+ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+</programlisting>
+    In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+    <literal>s3</> and <literal>s4</> are running, the three standbys <literal>s1</>,
+    <literal>s2</> and <literal>s3</> will be considered as synchronous standby
+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
    </sect3>
 
    <sect3 id="synchronous-replication-performance">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 128ee13..771787d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1437,6 +1437,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
            <literal>sync</>: This standby server is synchronous.
           </para>
          </listitem>
+         <listitem>
+         <para>
+          <literal>quorum</>: This standby is considered as a candidate of quorum commit.
+         </para>
+         </listitem>
        </itemizedlist>
      </entry>
     </row>
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index c99717e..da8bcf0 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -26,7 +26,7 @@ repl_gram.o: repl_scanner.c
 
 # syncrep_scanner is complied as part of syncrep_gram
 syncrep_gram.o: syncrep_scanner.c
-syncrep_scanner.c: FLEXFLAGS = -CF -p
+syncrep_scanner.c: FLEXFLAGS = -CF -p -i
 syncrep_scanner.c: FLEX_NO_BACKUP=yes
 
 # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..1de796c 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -31,16 +31,21 @@
  *
  * In 9.5 or before only a single standby could be considered as
  * synchronous. In 9.6 we support multiple synchronous standbys.
- * The number of synchronous standbys that transactions must wait for
- * replies from is specified in synchronous_standby_names.
- * This parameter also specifies a list of standby names,
- * which determines the priority of each standby for being chosen as
- * a synchronous standby. The standbys whose names appear earlier
- * in the list are given higher priority and will be considered as
- * synchronous. Other standby servers appearing later in this list
- * represent potential synchronous standbys. If any of the current
- * synchronous standbys disconnects for whatever reason, it will be
- * replaced immediately with the next-highest-priority standby.
+ * In 10.0 we support two synchronization methods, priority and
+ * quorum. The number of synchronous standbys that transactions
+ * must wait for replies from and synchronization method are
+ * specified in synchronous_standby_names. The priority method is
+ * represented by FIRST or nothing specified, and the quorum method
+ * is represented by ANY. This parameter also specifies a list of
+ * standby names, which determines the priority of each standby for
+ * being chosen as a synchronous standby. In priority method, the
+ * standbys whose names appear earlier in the list are given higher
+ * priority and will be considered as synchronous. Other standby
+ * servers appearing later in this list represent potential synchronous
+ * standbys. If any of the current synchronous standbys disconnects
+ * for whatever reason, it will be replaced immediately with the
+ * next-highest-priority standby. In quorum method, the all standbys
+ * appearing in the list are considered as a candidate for quorum commit.
  *
  * Before the standbys chosen from synchronous_standby_names can
  * become the synchronous standbys they must have caught up with
@@ -73,24 +78,27 @@
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
+SyncRepConfigData *SyncRepConfig = NULL;
 
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
 static bool announce_next_takeover = true;
 
-static SyncRepConfigData *SyncRepConfig = NULL;
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
+static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
+								 XLogRecPtr *flushPtr,
+								 XLogRecPtr *applyPtr,
+								 bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
+static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -386,7 +394,7 @@ SyncRepReleaseWaiters(void)
 	XLogRecPtr	writePtr;
 	XLogRecPtr	flushPtr;
 	XLogRecPtr	applyPtr;
-	bool		got_oldest;
+	bool		got_recptr;
 	bool		am_sync;
 	int			numwrite = 0;
 	int			numflush = 0;
@@ -413,11 +421,10 @@ SyncRepReleaseWaiters(void)
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 	/*
-	 * Check whether we are a sync standby or not, and calculate the oldest
-	 * positions among all sync standbys.
+	 * Check whether we are a sync standby or not, and calculate the synced
+	 * positions among all sync standbys using method.
 	 */
-	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -435,7 +442,7 @@ SyncRepReleaseWaiters(void)
 	 * If the number of sync standbys is less than requested or we aren't
 	 * managing a sync standby then just leave.
 	 */
-	if (!got_oldest || !am_sync)
+	if (!got_recptr || !am_sync)
 	{
 		LWLockRelease(SyncRepLock);
 		announce_next_takeover = !am_sync;
@@ -471,17 +478,50 @@ SyncRepReleaseWaiters(void)
 }
 
 /*
- * Calculate the oldest Write, Flush and Apply positions among sync standbys.
+ * Return the list of sync standbys according to synchronous method, or
+ * reutrn NIL if no sync standby is connected. The caller must hold SyncRepLock.
+ *
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+List *
+SyncRepGetSyncStandbys(bool	*am_sync)
+{
+	/* Set default result */
+	if (am_sync != NULL)
+		*am_sync = false;
+
+	/* Quick exit if sync replication is not requested */
+	if (SyncRepConfig == NULL)
+		return NIL;
+
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
+		return SyncRepGetSyncStandbysPriority(am_sync);
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+		return SyncRepGetSyncStandbysQuorum(am_sync);
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid synchronization method is specified \"%d\"",
+						SyncRepConfig->sync_method)));
+}
+
+/*
+ * Calculate the Write, Flush and Apply positions among sync standbys.
  *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * store the positions into *writePtr, *flushPtr and *applyPtr.
+ *
+ * In priority method, we need the oldest of these positions among sync
+ * standbys. In quorum method, we need the latest of these positions
+ * as specified by SyncRepConfig->num_sync.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
 static bool
-SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
+SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 						   XLogRecPtr *applyPtr, bool *am_sync)
 {
 	List	   *sync_standbys;
@@ -507,47 +547,146 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 		return false;
 	}
 
-	/*
-	 * Scan through all sync standbys and calculate the oldest Write, Flush
-	 * and Apply positions.
-	 */
-	foreach(cell, sync_standbys)
+	if (SyncRepConfig->sync_method == SYNC_REP_PRIORITY)
 	{
-		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
-		XLogRecPtr	write;
-		XLogRecPtr	flush;
-		XLogRecPtr	apply;
-
-		SpinLockAcquire(&walsnd->mutex);
-		write = walsnd->write;
-		flush = walsnd->flush;
-		apply = walsnd->apply;
-		SpinLockRelease(&walsnd->mutex);
-
-		if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
-			*writePtr = write;
-		if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
-			*flushPtr = flush;
-		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
-			*applyPtr = apply;
+		/*
+		 * Scan through all sync standbys and calculate the oldest
+		 * Write, Flush and Apply positions.
+		 */
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+			XLogRecPtr	write;
+			XLogRecPtr	flush;
+			XLogRecPtr	apply;
+
+			SpinLockAcquire(&walsnd->mutex);
+			write = walsnd->write;
+			flush = walsnd->flush;
+			apply = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			if (XLogRecPtrIsInvalid(*writePtr) || *writePtr > write)
+				*writePtr = write;
+			if (XLogRecPtrIsInvalid(*flushPtr) || *flushPtr > flush)
+				*flushPtr = flush;
+			if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
+				*applyPtr = apply;
+		}
 	}
+	else if (SyncRepConfig->sync_method == SYNC_REP_QUORUM)
+	{
+		XLogRecPtr	*write_array;
+		XLogRecPtr	*flush_array;
+		XLogRecPtr	*apply_array;
+		int len;
+		int i = 0;
+
+		len = list_length(sync_standbys);
+		write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+		apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
+
+		foreach (cell, sync_standbys)
+		{
+			WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
+
+			SpinLockAcquire(&walsnd->mutex);
+			write_array[i] = walsnd->write;
+			flush_array[i] = walsnd->flush;
+			apply_array[i] = walsnd->apply;
+			SpinLockRelease(&walsnd->mutex);
+
+			i++;
+		}
+
+		qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
+		qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
+
+		/*
+		 * Get N-th latest Write, Flush, Apply positions
+		 * specified by SyncRepConfig->num_sync.
+		 */
+		*writePtr = write_array[SyncRepConfig->num_sync - 1];
+		*flushPtr = flush_array[SyncRepConfig->num_sync - 1];
+		*applyPtr = apply_array[SyncRepConfig->num_sync - 1];
+
+		pfree(write_array);
+		pfree(flush_array);
+		pfree(apply_array);
+	}
+	else
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("invalid synchronization method is specified \"%d\"",
+						SyncRepConfig->sync_method)));
 
 	list_free(sync_standbys);
 	return true;
 }
 
 /*
- * Return the list of sync standbys, or NIL if no sync standby is connected.
+ * Return the list of sync standbys using quorum method, or return
+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
+ * standbys except for the standbys which are not active, or connected
+ * as async.
  *
- * If there are multiple standbys with the same priority,
+ * On return, *am_sync is set to true if this walsender is connecting to
+ * sync standby. Otherwise it's set to false.
+ */
+static List *
+SyncRepGetSyncStandbysQuorum(bool *am_sync)
+{
+	List	*result = NIL;
+	int i;
+
+	Assert(SyncRepConfig->sync_method == SYNC_REP_QUORUM);
+
+	for (i = 0; i < max_wal_senders; i++)
+	{
+		volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
+
+		/* Must be active */
+		if (walsnd->pid == 0)
+			continue;
+
+		/* Must be streaming */
+		if (walsnd->state != WALSNDSTATE_STREAMING)
+			continue;
+
+		/* Must be synchronous */
+		if (walsnd->sync_standby_priority == 0)
+			continue;
+
+		/* Must have a valid flush position */
+		if (XLogRecPtrIsInvalid(walsnd->flush))
+			continue;
+
+		/*
+		 * Consider this standby as candidate of sync and append
+		 * it to the result.
+		 */
+		result = lappend_int(result, i);
+		if (am_sync != NULL && walsnd == MyWalSnd)
+			*am_sync = true;
+	}
+
+	return result;
+}
+
+/*
+ * Return the list of sync standbys using priority method, or return
+ * NIL if no sync standby is connected. In priority method,
+ * if there are multiple standbys with the same priority,
  * the first one found is selected preferentially.
- * The caller must hold SyncRepLock.
  *
  * On return, *am_sync is set to true if this walsender is connecting to
  * sync standby. Otherwise it's set to false.
  */
-List *
-SyncRepGetSyncStandbys(bool *am_sync)
+static List *
+SyncRepGetSyncStandbysPriority(bool *am_sync)
 {
 	List	   *result = NIL;
 	List	   *pending = NIL;
@@ -560,13 +699,7 @@ SyncRepGetSyncStandbys(bool *am_sync)
 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
 								 * rearrangement */
 
-	/* Set default result */
-	if (am_sync != NULL)
-		*am_sync = false;
-
-	/* Quick exit if sync replication is not requested */
-	if (SyncRepConfig == NULL)
-		return NIL;
+	Assert(SyncRepConfig->sync_method == SYNC_REP_PRIORITY);
 
 	lowest_priority = SyncRepConfig->nmembers;
 	next_highest_priority = lowest_priority + 1;
@@ -892,6 +1025,23 @@ SyncRepQueueIsOrderedByLSN(int mode)
 #endif
 
 /*
+ * Compare lsn in order to sort array in descending order.
+ */
+static int
+cmp_lsn(const void *a, const void *b)
+{
+	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
+	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
+
+	if (lsn1 > lsn2)
+		return -1;
+	else if (lsn1 == lsn2)
+		return 0;
+	else
+		return 1;
+}
+
+/*
  * ===========================================================
  * Synchronous Replication functions executed by any process
  * ===========================================================
diff --git a/src/backend/replication/syncrep_gram.y b/src/backend/replication/syncrep_gram.y
index 35c2776..0ba7c4e 100644
--- a/src/backend/replication/syncrep_gram.y
+++ b/src/backend/replication/syncrep_gram.y
@@ -21,7 +21,7 @@ SyncRepConfigData *syncrep_parse_result;
 char	   *syncrep_parse_error_msg;
 
 static SyncRepConfigData *create_syncrep_config(const char *num_sync,
-					  List *members);
+					List *members, uint8 sync_method);
 
 /*
  * Bison doesn't allocate anything that needs to live across parser calls,
@@ -46,7 +46,7 @@ static SyncRepConfigData *create_syncrep_config(const char *num_sync,
 	SyncRepConfigData *config;
 }
 
-%token <str> NAME NUM JUNK
+%token <str> NAME NUM JUNK ANY FIRST
 
 %type <config> result standby_config
 %type <list> standby_list
@@ -60,8 +60,10 @@ result:
 	;
 
 standby_config:
-		standby_list				{ $$ = create_syncrep_config("1", $1); }
-		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
+		standby_list						{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
+		| NUM '(' standby_list ')'			{ $$ = create_syncrep_config($1, $3, SYNC_REP_PRIORITY); }
+		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
+		| FIRST NUM '(' standby_list ')'	{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
 	;
 
 standby_list:
@@ -75,9 +77,8 @@ standby_name:
 	;
 %%
 
-
 static SyncRepConfigData *
-create_syncrep_config(const char *num_sync, List *members)
+create_syncrep_config(const char *num_sync, List *members, uint8 sync_method)
 {
 	SyncRepConfigData *config;
 	int			size;
@@ -98,6 +99,7 @@ create_syncrep_config(const char *num_sync, List *members)
 
 	config->config_size = size;
 	config->num_sync = atoi(num_sync);
+	config->sync_method = sync_method;
 	config->nmembers = list_length(members);
 	ptr = config->member_names;
 	foreach(lc, members)
diff --git a/src/backend/replication/syncrep_scanner.l b/src/backend/replication/syncrep_scanner.l
index d20662e..c08e95b 100644
--- a/src/backend/replication/syncrep_scanner.l
+++ b/src/backend/replication/syncrep_scanner.l
@@ -54,6 +54,8 @@ digit			[0-9]
 ident_start		[A-Za-z\200-\377_]
 ident_cont		[A-Za-z\200-\377_0-9\$]
 identifier		{ident_start}{ident_cont}*
+any_ident		any
+first_ident		first
 
 dquote			\"
 xdstart			{dquote}
@@ -64,6 +66,8 @@ xdinside		[^"]+
 %%
 {space}+	{ /* ignore */ }
 
+{any_ident}	{ return ANY; }
+{first_ident}	{ return FIRST; }
 {xdstart}	{
 				initStringInfo(&xdbuf);
 				BEGIN(xd);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b14d821..fe396a8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2868,12 +2868,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 
 			/*
 			 * More easily understood version of standby state. This is purely
-			 * informational, not different from priority.
+			 * informational. In quorum method, since all standbys are considered as
+			 * a candidate of quorum commit standby state is always 'quorum'.
 			 */
 			if (priority == 0)
 				values[7] = CStringGetTextDatum("async");
 			else if (list_member_int(sync_standbys, i))
-				values[7] = CStringGetTextDatum("sync");
+				values[7] = SyncRepConfig->sync_method == SYNC_REP_PRIORITY ?
+					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
 			else
 				values[7] = CStringGetTextDatum("potential");
 		}
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7f9acfd..b332247 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -245,7 +245,8 @@
 # These settings are ignored on a standby server.
 
 #synchronous_standby_names = ''	# standby servers that provide sync rep
-				# number of sync standbys and comma-separated list of application_name
+				# synchronization method, number of sync standbys
+				# and comma-separated list of application_name
 				# from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..29c35e4 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -32,6 +32,10 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+/* sync_method of SyncRepConfigData */
+#define SYNC_REP_PRIORITY	0
+#define SYNC_REP_QUORUM		1
+
 /*
  * Struct for the configuration of synchronous replication.
  *
@@ -45,10 +49,13 @@ typedef struct SyncRepConfigData
 	int			num_sync;		/* number of sync standbys that we need to
 								 * wait for */
 	int			nmembers;		/* number of members in the following list */
+	uint8		sync_method;	/* synchronization method */
 	/* member_names contains nmembers consecutive nul-terminated C strings */
 	char		member_names[FLEXIBLE_ARRAY_MEMBER];
 } SyncRepConfigData;
 
+extern SyncRepConfigData *SyncRepConfig;
+
 /* communication variables for parsing synchronous_standby_names GUC */
 extern SyncRepConfigData *syncrep_parse_result;
 extern char *syncrep_parse_error_msg;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index 0c87226..00e1ea0 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 8;
+use Test::More tests => 11;
 
 # Query checking sync_priority and sync_state of each standby
 my $check_sql =
@@ -172,3 +172,34 @@ test_sync_state(
 standby2|1|sync
 standby4|1|potential),
 	'potential standby found earlier in array is promoted to sync');
+
+# Check that priority method is used and standby1 and standby2 are considered
+# as synchronous standby.
+test_sync_state(
+$node_master, qq(standby1|1|sync
+standby2|2|sync
+standby4|0|async),
+'specify priority method by FIRST',
+'FIRST 2(standby1, standby2)');
+
+# Check that the state of standbys listed as a voter when the quorum
+# method is used.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|2|quorum
+standby4|0|async),
+'2 quorum and 1 async',
+'ANY 2(standby1, standby2)');
+
+# Start Standby3 which will be considered in 'quorum' state.
+$node_standby_3->start;
+
+# Check that set setting of 'ANY 2(*)' chooses all standbys as
+# voter.
+test_sync_state(
+$node_master, qq(standby1|1|quorum
+standby2|1|quorum
+standby3|1|quorum
+standby4|1|quorum),
+'all standbys are considered as candidates for quorum commit',
+'ANY 2(*)');

#62

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#61)

Re: Quorum commit for multiple synchronous replication.

On Thu, Dec 15, 2016 at 6:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest v12 patch.
I changed behavior of "N (standby_list)" to use the priority method
and incorporated some review comments so far. Please review it.

Some comments...

+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
Here standby takes an 's'.

+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
"will wait for replies from at least two of them".

+ * next-highest-priority standby. In quorum method, the all standbys
+ * appearing in the list are considered as a candidate for quorum commit.
"the all" is incorrect. I think you mean "all the" instead.

+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
This is not true. Standys have a priority number assigned. Though it does
not matter much for quorum groups, it gives an indication of their position
in the defined list.

 #synchronous_standby_names = ''    # standby servers that provide sync rep
 -               # number of sync standbys and comma-separated list of application_name
 +               # synchronization method, number of sync standbys
 +               # and comma-separated list of application_name
                 # from standby(s); '*' = all
The formulation is funny here: "sync rep synchronization method".

I think that Fujii-san has also some doc changes in his box. For anybody
picking up this patch next, it would be good to incorporate the things
I have noticed here.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Michael Paquier (#62)

Re: Quorum commit for multiple synchronous replication.

On Fri, Dec 16, 2016 at 2:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 15, 2016 at 6:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest v12 patch.
I changed behavior of "N (standby_list)" to use the priority method
and incorporated some review comments so far. Please review it.

Some comments...

+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
Here standby takes an 's'.

+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
"will wait for replies from at least two of them".

+ * next-highest-priority standby. In quorum method, the all standbys
+ * appearing in the list are considered as a candidate for quorum commit.
"the all" is incorrect. I think you mean "all the" instead.

+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
This is not true. Standys have a priority number assigned. Though it does
not matter much for quorum groups, it gives an indication of their position
in the defined list.

#synchronous_standby_names = ''    # standby servers that provide sync rep
-               # number of sync standbys and comma-separated list of application_name
+               # synchronization method, number of sync standbys
+               # and comma-separated list of application_name
# from standby(s); '*' = all
The formulation is funny here: "sync rep synchronization method".

I think that Fujii-san has also some doc changes in his box. For anybody
picking up this patch next, it would be good to incorporate the things
I have noticed here.

Yes, I will. Thanks!

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Fujii Masao (#63)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Fri, Dec 16, 2016 at 5:04 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Dec 16, 2016 at 2:38 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Thu, Dec 15, 2016 at 6:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached latest v12 patch.
I changed behavior of "N (standby_list)" to use the priority method
and incorporated some review comments so far. Please review it.

Some comments...

+    Another example of <varname>synchronous_standby_names</> for multiple
+    synchronous standby is:
Here standby takes an 's'.

+    candidates. The master server will wait for at least 2 replies from them.
+    <literal>s4</> is an asynchronous standby since its name is not in the list.
+   </para>
"will wait for replies from at least two of them".

+ * next-highest-priority standby. In quorum method, the all standbys
+ * appearing in the list are considered as a candidate for quorum commit.
"the all" is incorrect. I think you mean "all the" instead.

+ * NIL if no sync standby is connected. In quorum method, all standby
+ * priorities are same, that is 1. So this function returns the list of
This is not true. Standys have a priority number assigned. Though it does
not matter much for quorum groups, it gives an indication of their position
in the defined list.

#synchronous_standby_names = ''    # standby servers that provide sync rep
-               # number of sync standbys and comma-separated list of application_name
+               # synchronization method, number of sync standbys
+               # and comma-separated list of application_name
# from standby(s); '*' = all
The formulation is funny here: "sync rep synchronization method".

I think that Fujii-san has also some doc changes in his box. For anybody
picking up this patch next, it would be good to incorporate the things
I have noticed here.

Yes, I will. Thanks!

Attached is the modified version of the patch. Barring objections, I will
commit this version.

Even after committing the patch, there will be still many source comments
and documentations that we need to update, for example,
in high-availability.sgml. We need to check and update them throughly later.

Regards,

--
Fujii Masao

Attachments:

quorum_commit_v13.patchtext/x-patch; charset=US-ASCII; name=quorum_commit_v13.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 3054,3094 **** include_dir 'conf.d'
          transactions waiting for commit will be allowed to proceed after
          these standby servers confirm receipt of their data.
          The synchronous standbys will be those whose names appear
!         earlier in this list, and
          that are both currently connected and streaming data in real-time
          (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
!         Other standby servers appearing later in this list represent potential
!         synchronous standbys. If any of the current synchronous
!         standbys disconnects for whatever reason,
!         it will be replaced immediately with the next-highest-priority standby.
!         Specifying more than one standby name can allow very high availability.
         </para>
         <para>
          This parameter specifies a list of standby servers using
          either of the following syntaxes:
  <synopsis>
! <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
  <replaceable class="parameter">standby_name</replaceable> [, ...]
  </synopsis>
          where <replaceable class="parameter">num_sync</replaceable> is
          the number of synchronous standbys that transactions need to
          wait for replies from,
          and <replaceable class="parameter">standby_name</replaceable>
!         is the name of a standby server. For example, a setting of
!         <literal>3 (s1, s2, s3, s4)</> makes transaction commits wait
!         until their WAL records are received by three higher-priority standbys
!         chosen from standby servers <literal>s1</>, <literal>s2</>,
!         <literal>s3</> and <literal>s4</>.
!         </para>
!         <para>
!         The second syntax was used before <productname>PostgreSQL</>
          version 9.6 and is still supported. It's the same as the first syntax
!         with <replaceable class="parameter">num_sync</replaceable> equal to 1.
!         For example, <literal>1 (s1, s2)</> and
!         <literal>s1, s2</> have the same meaning: either <literal>s1</>
!         or <literal>s2</> is chosen as a synchronous standby.
         </para>
         <para>
          The name of a standby server for this purpose is the
--- 3054,3124 ----
          transactions waiting for commit will be allowed to proceed after
          these standby servers confirm receipt of their data.
          The synchronous standbys will be those whose names appear
!         in this list, and
          that are both currently connected and streaming data in real-time
          (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
!         Specifying more than one standby names can allow very high availability.
         </para>
         <para>
          This parameter specifies a list of standby servers using
          either of the following syntaxes:
  <synopsis>
! [FIRST] <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
! ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="parameter">standby_name</replaceable> [, ...] )
  <replaceable class="parameter">standby_name</replaceable> [, ...]
  </synopsis>
          where <replaceable class="parameter">num_sync</replaceable> is
          the number of synchronous standbys that transactions need to
          wait for replies from,
          and <replaceable class="parameter">standby_name</replaceable>
!         is the name of a standby server.
!         <literal>FIRST</> and <literal>ANY</> specify the method to choose
!         synchronous standbys from the listed servers.
!        </para>
!        <para>
!         The keyword <literal>FIRST</>, coupled with
!         <replaceable class="parameter">num_sync</replaceable>, specifies a
!         priority-based synchronous replication and makes transaction commits
!         wait until their WAL records are replicated to
!         <replaceable class="parameter">num_sync</replaceable> synchronous
!         standbys chosen based on their priorities. For example, a setting of
!         <literal>FIRST 3 (s1, s2, s3, s4)</> will cause each commit to wait for
!         replies from three higher-priority standbys chosen from standby servers
!         <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>.
!         The standbys whose names appear earlier in the list are given higher
!         priority and will be considered as synchronous. Other standby servers
!         appearing later in this list represent potential synchronous standbys.
!         If any of the current synchronous standbys disconnects for whatever
!         reason, it will be replaced immediately with the next-highest-priority
!         standby. The keyword <literal>FIRST</> is optional.
!        </para>
!        <para>
!         The keyword <literal>ANY</>, coupled with
!         <replaceable class="parameter">num_sync</replaceable>, specifies a
!         quorum-based synchronous replication and makes transaction commits
!         wait until their WAL records are replicated to <emphasis>at least</>
!         <replaceable class="parameter">num_sync</replaceable> listed standbys.
!         For example, a setting of <literal>ANY 3 (s1, s2, s3, s4)</> will cause
!         each commit to proceed as soon as at least any three standbys of
!         <literal>s1</>, <literal>s2</>, <literal>s3</> and <literal>s4</>
!         reply.
!        </para>
!        <para>
!         <literal>FIRST</> and <literal>ANY</> are case-insensitive. If these
!         keywords are used as the name of a standby server,
!         its <replaceable class="parameter">standby_name</replaceable> must
!         be double-quoted.
!        </para>
!        <para>
!         The third syntax was used before <productname>PostgreSQL</>
          version 9.6 and is still supported. It's the same as the first syntax
!         with <literal>FIRST</> and
!         <replaceable class="parameter">num_sync</replaceable> equal to 1.
!         For example, <literal>FIRST 1 (s1, s2)</> and <literal>s1, s2</> have
!         the same meaning: either <literal>s1</> or <literal>s2</> is chosen
!         as a synchronous standby.
         </para>
         <para>
          The name of a standby server for this purpose is the
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1138,1156 **** primary_slot_name = 'node_a_slot'
      as synchronous confirm receipt of their data. The number of synchronous
      standbys that transactions must wait for replies from is specified in
      <varname>synchronous_standby_names</>. This parameter also specifies
!     a list of standby names, which determines the priority of each standby
!     for being chosen as a synchronous standby. The standbys whose names
!     appear earlier in the list are given higher priority and will be considered
!     as synchronous. Other standby servers appearing later in this list
!     represent potential synchronous standbys. If any of the current
!     synchronous standbys disconnects for whatever reason, it will be replaced
!     immediately with the next-highest-priority standby.
     </para>
     <para>
!     An example of <varname>synchronous_standby_names</> for multiple
!     synchronous standbys is:
  <programlisting>
! synchronous_standby_names = '2 (s1, s2, s3)'
  </programlisting>
      In this example, if four standby servers <literal>s1</>, <literal>s2</>,
      <literal>s3</> and <literal>s4</> are running, the two standbys
--- 1138,1162 ----
      as synchronous confirm receipt of their data. The number of synchronous
      standbys that transactions must wait for replies from is specified in
      <varname>synchronous_standby_names</>. This parameter also specifies
!     a list of standby names and the method (<literal>FIRST</> and
!     <literal>ANY</>) to choose synchronous standbys from the listed ones.
     </para>
     <para>
!     The method <literal>FIRST</> specifies a priority-based synchronous
!     replication and makes transaction commits wait until their WAL records are
!     replicated to the requested number of synchronous standbys chosen based on
!     their priorities. The standbys whose names appear earlier in the list are
!     given higher priority and will be considered as synchronous. Other standby
!     servers appearing later in this list represent potential synchronous
!     standbys. If any of the current synchronous standbys disconnects for
!     whatever reason, it will be replaced immediately with the
!     next-highest-priority standby.
!    </para>
!    <para>
!     An example of <varname>synchronous_standby_names</> for
!     a priority-based multiple synchronous standbys is:
  <programlisting>
! synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
  </programlisting>
      In this example, if four standby servers <literal>s1</>, <literal>s2</>,
      <literal>s3</> and <literal>s4</> are running, the two standbys
***************
*** 1162,1167 **** synchronous_standby_names = '2 (s1, s2, s3)'
--- 1168,1191 ----
      its name is not in the list.
     </para>
     <para>
+     The method <literal>ANY</> specifies a quorum-based synchronous
+     replication and makes transaction commits wait until their WAL records
+     are replicated to <emphasis>at least</> the requested number of
+     synchronous standbys in the list.
+    </para>
+    <para>
+     An example of <varname>synchronous_standby_names</> for
+     a quorum-based multiple synchronous standbys is:
+ <programlisting>
+  synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
+ </programlisting>
+     In this example, if four standby servers <literal>s1</>, <literal>s2</>,
+     <literal>s3</> and <literal>s4</> are running, transaction commits will
+     wait for replies from at least any two standbys of <literal>s1</>,
+     <literal>s2</> and <literal>s3</>. <literal>s4</> is an asynchronous
+     standby since its name is not in the list.
+    </para>
+    <para>
      The synchronous states of standby servers can be viewed using
      the <structname>pg_stat_replication</structname> view.
     </para>
*** a/doc/src/sgml/monitoring.sgml
--- b/doc/src/sgml/monitoring.sgml
***************
*** 1412,1418 **** SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       <entry><structfield>sync_priority</></entry>
       <entry><type>integer</></entry>
       <entry>Priority of this standby server for being chosen as the
!       synchronous standby</entry>
      </row>
      <row>
       <entry><structfield>sync_state</></entry>
--- 1412,1419 ----
       <entry><structfield>sync_priority</></entry>
       <entry><type>integer</></entry>
       <entry>Priority of this standby server for being chosen as the
!       synchronous standby in a priority-based synchronous replication.
!       This has no effect in a quorum-based synchronous replication.</entry>
      </row>
      <row>
       <entry><structfield>sync_state</></entry>
***************
*** 1437,1442 **** SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
--- 1438,1449 ----
             <literal>sync</>: This standby server is synchronous.
            </para>
           </listitem>
+          <listitem>
+           <para>
+            <literal>quorum</>: This standby server is considered as a candidate
+            for quorum standbys.
+           </para>
+          </listitem>
         </itemizedlist>
       </entry>
      </row>
*** a/src/backend/replication/Makefile
--- b/src/backend/replication/Makefile
***************
*** 26,32 **** repl_gram.o: repl_scanner.c
  
  # syncrep_scanner is complied as part of syncrep_gram
  syncrep_gram.o: syncrep_scanner.c
! syncrep_scanner.c: FLEXFLAGS = -CF -p
  syncrep_scanner.c: FLEX_NO_BACKUP=yes
  
  # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
--- 26,32 ----
  
  # syncrep_scanner is complied as part of syncrep_gram
  syncrep_gram.o: syncrep_scanner.c
! syncrep_scanner.c: FLEXFLAGS = -CF -p -i
  syncrep_scanner.c: FLEX_NO_BACKUP=yes
  
  # repl_gram.c, repl_scanner.c, syncrep_gram.c and syncrep_scanner.c
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 30,52 ****
   * searching the through all waiters each time we receive a reply.
   *
   * In 9.5 or before only a single standby could be considered as
!  * synchronous. In 9.6 we support multiple synchronous standbys.
!  * The number of synchronous standbys that transactions must wait for
!  * replies from is specified in synchronous_standby_names.
!  * This parameter also specifies a list of standby names,
!  * which determines the priority of each standby for being chosen as
!  * a synchronous standby. The standbys whose names appear earlier
!  * in the list are given higher priority and will be considered as
!  * synchronous. Other standby servers appearing later in this list
!  * represent potential synchronous standbys. If any of the current
!  * synchronous standbys disconnects for whatever reason, it will be
!  * replaced immediately with the next-highest-priority standby.
   *
   * Before the standbys chosen from synchronous_standby_names can
   * become the synchronous standbys they must have caught up with
   * the primary; that may take some time. Once caught up,
!  * the current higher priority standbys which are considered as
!  * synchronous at that moment will release waiters from the queue.
   *
   * Portions Copyright (c) 2010-2016, PostgreSQL Global Development Group
   *
--- 30,63 ----
   * searching the through all waiters each time we receive a reply.
   *
   * In 9.5 or before only a single standby could be considered as
!  * synchronous. In 9.6 we support a priority-based multiple synchronous
!  * standbys. In 10.0 a quorum-based multiple synchronous standbys is also
!  * supported. The number of synchronous standbys that transactions
!  * must wait for replies from is specified in synchronous_standby_names.
!  * This parameter also specifies a list of standby names and the method
!  * (FIRST and ANY) to choose synchronous standbys from the listed ones.
!  * 
!  * The method FIRST specifies a priority-based synchronous replication
!  * and makes transaction commits wait until their WAL records are
!  * replicated to the requested number of synchronous standbys chosen based
!  * on their priorities. The standbys whose names appear earlier in the list
!  * are given higher priority and will be considered as synchronous.
!  * Other standby servers appearing later in this list represent potential
!  * synchronous standbys. If any of the current synchronous standbys
!  * disconnects for whatever reason, it will be replaced immediately with
!  * the next-highest-priority standby.
!  *
!  * The method ANY specifies a quorum-based synchronous replication
!  * and makes transaction commits wait until their WAL records are
!  * replicated to at least the requested number of synchronous standbys
!  * in the list. All the standbys appearing in the list are considered as
!  * candidates for quorum synchronous standbys.
   *
   * Before the standbys chosen from synchronous_standby_names can
   * become the synchronous standbys they must have caught up with
   * the primary; that may take some time. Once caught up,
!  * the standbys which are considered as synchronous at that moment
!  * will release waiters from the queue.
   *
   * Portions Copyright (c) 2010-2016, PostgreSQL Global Development Group
   *
***************
*** 79,96 **** char	   *SyncRepStandbyNames;
  
  static bool announce_next_takeover = true;
  
! static SyncRepConfigData *SyncRepConfig = NULL;
  static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
  
  static void SyncRepQueueInsert(int mode);
  static void SyncRepCancelWait(void);
  static int	SyncRepWakeQueue(bool all, int mode);
  
! static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
! 						   XLogRecPtr *flushPtr,
! 						   XLogRecPtr *applyPtr,
! 						   bool *am_sync);
  static int	SyncRepGetStandbyPriority(void);
  
  #ifdef USE_ASSERT_CHECKING
  static bool SyncRepQueueIsOrderedByLSN(int mode);
--- 90,118 ----
  
  static bool announce_next_takeover = true;
  
! SyncRepConfigData *SyncRepConfig = NULL;
  static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
  
  static void SyncRepQueueInsert(int mode);
  static void SyncRepCancelWait(void);
  static int	SyncRepWakeQueue(bool all, int mode);
  
! static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
! 								 XLogRecPtr *flushPtr,
! 								 XLogRecPtr *applyPtr,
! 								 bool *am_sync);
! static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
! 									   XLogRecPtr *flushPtr,
! 									   XLogRecPtr *applyPtr,
! 									   List *sync_standbys);
! static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
! 										  XLogRecPtr *flushPtr,
! 										  XLogRecPtr *applyPtr,
! 										  List *sync_standbys, uint8 nth);
  static int	SyncRepGetStandbyPriority(void);
+ static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
+ static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+ static int	cmp_lsn(const void *a, const void *b);
  
  #ifdef USE_ASSERT_CHECKING
  static bool SyncRepQueueIsOrderedByLSN(int mode);
***************
*** 386,392 **** SyncRepReleaseWaiters(void)
  	XLogRecPtr	writePtr;
  	XLogRecPtr	flushPtr;
  	XLogRecPtr	applyPtr;
! 	bool		got_oldest;
  	bool		am_sync;
  	int			numwrite = 0;
  	int			numflush = 0;
--- 408,414 ----
  	XLogRecPtr	writePtr;
  	XLogRecPtr	flushPtr;
  	XLogRecPtr	applyPtr;
! 	bool		got_recptr;
  	bool		am_sync;
  	int			numwrite = 0;
  	int			numflush = 0;
***************
*** 413,423 **** SyncRepReleaseWaiters(void)
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
  
  	/*
! 	 * Check whether we are a sync standby or not, and calculate the oldest
  	 * positions among all sync standbys.
  	 */
! 	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
! 											&applyPtr, &am_sync);
  
  	/*
  	 * If we are managing a sync standby, though we weren't prior to this,
--- 435,444 ----
  	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
  
  	/*
! 	 * Check whether we are a sync standby or not, and calculate the synced
  	 * positions among all sync standbys.
  	 */
! 	got_recptr = SyncRepGetSyncRecPtr(&writePtr, &flushPtr, &applyPtr, &am_sync);
  
  	/*
  	 * If we are managing a sync standby, though we weren't prior to this,
***************
*** 426,441 **** SyncRepReleaseWaiters(void)
  	if (announce_next_takeover && am_sync)
  	{
  		announce_next_takeover = false;
! 		ereport(LOG,
! 				(errmsg("standby \"%s\" is now a synchronous standby with priority %u",
! 						application_name, MyWalSnd->sync_standby_priority)));
  	}
  
  	/*
  	 * If the number of sync standbys is less than requested or we aren't
  	 * managing a sync standby then just leave.
  	 */
! 	if (!got_oldest || !am_sync)
  	{
  		LWLockRelease(SyncRepLock);
  		announce_next_takeover = !am_sync;
--- 447,468 ----
  	if (announce_next_takeover && am_sync)
  	{
  		announce_next_takeover = false;
! 
! 		if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY)
! 			ereport(LOG,
! 					(errmsg("standby \"%s\" is now a synchronous standby with priority %u",
! 							application_name, MyWalSnd->sync_standby_priority)));
! 		else
! 			ereport(LOG,
! 					(errmsg("standby \"%s\" is now a candidate for quorum synchronous standby",
! 							application_name)));
  	}
  
  	/*
  	 * If the number of sync standbys is less than requested or we aren't
  	 * managing a sync standby then just leave.
  	 */
! 	if (!got_recptr || !am_sync)
  	{
  		LWLockRelease(SyncRepLock);
  		announce_next_takeover = !am_sync;
***************
*** 471,491 **** SyncRepReleaseWaiters(void)
  }
  
  /*
!  * Calculate the oldest Write, Flush and Apply positions among sync standbys.
   *
   * Return false if the number of sync standbys is less than
   * synchronous_standby_names specifies. Otherwise return true and
!  * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
   *
   * On return, *am_sync is set to true if this walsender is connecting to
   * sync standby. Otherwise it's set to false.
   */
  static bool
! SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
  						   XLogRecPtr *applyPtr, bool *am_sync)
  {
  	List	   *sync_standbys;
- 	ListCell   *cell;
  
  	*writePtr = InvalidXLogRecPtr;
  	*flushPtr = InvalidXLogRecPtr;
--- 498,517 ----
  }
  
  /*
!  * Calculate the synced Write, Flush and Apply positions among sync standbys.
   *
   * Return false if the number of sync standbys is less than
   * synchronous_standby_names specifies. Otherwise return true and
!  * store the positions into *writePtr, *flushPtr and *applyPtr.
   *
   * On return, *am_sync is set to true if this walsender is connecting to
   * sync standby. Otherwise it's set to false.
   */
  static bool
! SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
  						   XLogRecPtr *applyPtr, bool *am_sync)
  {
  	List	   *sync_standbys;
  
  	*writePtr = InvalidXLogRecPtr;
  	*flushPtr = InvalidXLogRecPtr;
***************
*** 508,519 **** SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
  	}
  
  	/*
! 	 * Scan through all sync standbys and calculate the oldest Write, Flush
! 	 * and Apply positions.
  	 */
! 	foreach(cell, sync_standbys)
  	{
! 		WalSnd	   *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
  		XLogRecPtr	write;
  		XLogRecPtr	flush;
  		XLogRecPtr	apply;
--- 534,582 ----
  	}
  
  	/*
! 	 * In a priority-based sync replication, the synced positions are the
! 	 * oldest ones among sync standbys. In a quorum-based, they are the Nth
! 	 * latest ones.
! 	 *
! 	 * SyncRepGetNthLatestSyncRecPtr() also can calculate the oldest positions.
! 	 * But we use SyncRepGetOldestSyncRecPtr() for that calculation because
! 	 * it's a bit more efficient.
! 	 *
! 	 * XXX If the numbers of current and requested sync standbys are the same,
! 	 * we can use SyncRepGetOldestSyncRecPtr() to calculate the synced
! 	 * positions even in a quorum-based sync replication.
! 	 */
! 	if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY)
! 	{
! 		SyncRepGetOldestSyncRecPtr(writePtr, flushPtr, applyPtr,
! 								   sync_standbys);
! 	}
! 	else
! 	{
! 		SyncRepGetNthLatestSyncRecPtr(writePtr, flushPtr, applyPtr,
! 									  sync_standbys, SyncRepConfig->num_sync);
! 	}
! 
! 	list_free(sync_standbys);
! 	return true;
! }
! 
! /*
!  * Calculate the oldest Write, Flush and Apply positions among sync standbys.
!  */
! static void
! SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
! 						   XLogRecPtr *applyPtr, List *sync_standbys)
! {
! 	ListCell	*cell;
! 
! 	/*
! 	 * Scan through all sync standbys and calculate the oldest
! 	 * Write, Flush and Apply positions.
  	 */
! 	foreach (cell, sync_standbys)
  	{
! 		WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
  		XLogRecPtr	write;
  		XLogRecPtr	flush;
  		XLogRecPtr	apply;
***************
*** 531,553 **** SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
  		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
  			*applyPtr = apply;
  	}
  
! 	list_free(sync_standbys);
! 	return true;
  }
  
  /*
   * Return the list of sync standbys, or NIL if no sync standby is connected.
   *
-  * If there are multiple standbys with the same priority,
-  * the first one found is selected preferentially.
   * The caller must hold SyncRepLock.
   *
   * On return, *am_sync is set to true if this walsender is connecting to
   * sync standby. Otherwise it's set to false.
   */
  List *
! SyncRepGetSyncStandbys(bool *am_sync)
  {
  	List	   *result = NIL;
  	List	   *pending = NIL;
--- 594,756 ----
  		if (XLogRecPtrIsInvalid(*applyPtr) || *applyPtr > apply)
  			*applyPtr = apply;
  	}
+ }
  
! /*
!  * Calculate the Nth latest Write, Flush and Apply positions among sync
!  * standbys.
!  */
! static void
! SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
! 						  XLogRecPtr *applyPtr, List *sync_standbys, uint8 nth)
! {
! 	ListCell	*cell;
! 	XLogRecPtr	*write_array;
! 	XLogRecPtr	*flush_array;
! 	XLogRecPtr	*apply_array;
! 	int	len;
! 	int	i = 0;
! 
! 	len = list_length(sync_standbys);
! 	write_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
! 	flush_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
! 	apply_array = (XLogRecPtr *) palloc(sizeof(XLogRecPtr) * len);
! 
! 	foreach (cell, sync_standbys)
! 	{
! 		WalSnd *walsnd = &WalSndCtl->walsnds[lfirst_int(cell)];
! 
! 		SpinLockAcquire(&walsnd->mutex);
! 		write_array[i] = walsnd->write;
! 		flush_array[i] = walsnd->flush;
! 		apply_array[i] = walsnd->apply;
! 		SpinLockRelease(&walsnd->mutex);
! 
! 		i++;
! 	}
! 
! 	qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
! 	qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
! 	qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
! 
! 	/* Get Nth latest Write, Flush, Apply positions */
! 	*writePtr = write_array[nth - 1];
! 	*flushPtr = flush_array[nth - 1];
! 	*applyPtr = apply_array[nth - 1];
! 
! 	pfree(write_array);
! 	pfree(flush_array);
! 	pfree(apply_array);
! }
! 
! /*
!  * Compare lsn in order to sort array in descending order.
!  */
! static int
! cmp_lsn(const void *a, const void *b)
! {
! 	XLogRecPtr lsn1 = *((const XLogRecPtr *) a);
! 	XLogRecPtr lsn2 = *((const XLogRecPtr *) b);
! 
! 	if (lsn1 > lsn2)
! 		return -1;
! 	else if (lsn1 == lsn2)
! 		return 0;
! 	else
! 		return 1;
  }
  
  /*
   * Return the list of sync standbys, or NIL if no sync standby is connected.
   *
   * The caller must hold SyncRepLock.
   *
   * On return, *am_sync is set to true if this walsender is connecting to
   * sync standby. Otherwise it's set to false.
   */
  List *
! SyncRepGetSyncStandbys(bool	*am_sync)
! {
! 	/* Set default result */
! 	if (am_sync != NULL)
! 		*am_sync = false;
! 
! 	/* Quick exit if sync replication is not requested */
! 	if (SyncRepConfig == NULL)
! 		return NIL;
! 
! 	return (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY) ?
! 		SyncRepGetSyncStandbysPriority(am_sync) :
! 		SyncRepGetSyncStandbysQuorum(am_sync);
! }
! 
! /*
!  * Return the list of all the candidates for quorum sync standbys,
!  * or NIL if no such standby is connected.
!  *
!  * The caller must hold SyncRepLock. This function must be called only in
!  * a quorum-based sync replication.
!  *
!  * On return, *am_sync is set to true if this walsender is connecting to
!  * sync standby. Otherwise it's set to false.
!  */
! static List *
! SyncRepGetSyncStandbysQuorum(bool *am_sync)
! {
! 	List	*result = NIL;
! 	int i;
! 	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
! 								 * rearrangement */
! 
! 	Assert(SyncRepConfig->syncrep_method == SYNC_REP_QUORUM);
! 
! 	for (i = 0; i < max_wal_senders; i++)
! 	{
! 		walsnd = &WalSndCtl->walsnds[i];
! 
! 		/* Must be active */
! 		if (walsnd->pid == 0)
! 			continue;
! 
! 		/* Must be streaming */
! 		if (walsnd->state != WALSNDSTATE_STREAMING)
! 			continue;
! 
! 		/* Must be synchronous */
! 		if (walsnd->sync_standby_priority == 0)
! 			continue;
! 
! 		/* Must have a valid flush position */
! 		if (XLogRecPtrIsInvalid(walsnd->flush))
! 			continue;
! 
! 		/*
! 		 * Consider this standby as a candidate for quorum sync standbys
! 		 * and append it to the result.
! 		 */
! 		result = lappend_int(result, i);
! 		if (am_sync != NULL && walsnd == MyWalSnd)
! 			*am_sync = true;
! 	}
! 
! 	return result;
! }
! 
! /*
!  * Return the list of sync standbys chosen based on their priorities,
!  * or NIL if no sync standby is connected.
!  *
!  * If there are multiple standbys with the same priority,
!  * the first one found is selected preferentially.
!  *
!  * The caller must hold SyncRepLock. This function must be called only in
!  * a priority-based sync replication.
!  *
!  * On return, *am_sync is set to true if this walsender is connecting to
!  * sync standby. Otherwise it's set to false.
!  */
! static List *
! SyncRepGetSyncStandbysPriority(bool *am_sync)
  {
  	List	   *result = NIL;
  	List	   *pending = NIL;
***************
*** 560,572 **** SyncRepGetSyncStandbys(bool *am_sync)
  	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
  								 * rearrangement */
  
! 	/* Set default result */
! 	if (am_sync != NULL)
! 		*am_sync = false;
! 
! 	/* Quick exit if sync replication is not requested */
! 	if (SyncRepConfig == NULL)
! 		return NIL;
  
  	lowest_priority = SyncRepConfig->nmembers;
  	next_highest_priority = lowest_priority + 1;
--- 763,769 ----
  	volatile WalSnd *walsnd;	/* Use volatile pointer to prevent code
  								 * rearrangement */
  
! 	Assert(SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY);
  
  	lowest_priority = SyncRepConfig->nmembers;
  	next_highest_priority = lowest_priority + 1;
*** a/src/backend/replication/syncrep_gram.y
--- b/src/backend/replication/syncrep_gram.y
***************
*** 21,27 **** SyncRepConfigData *syncrep_parse_result;
  char	   *syncrep_parse_error_msg;
  
  static SyncRepConfigData *create_syncrep_config(const char *num_sync,
! 					  List *members);
  
  /*
   * Bison doesn't allocate anything that needs to live across parser calls,
--- 21,27 ----
  char	   *syncrep_parse_error_msg;
  
  static SyncRepConfigData *create_syncrep_config(const char *num_sync,
! 					List *members, uint8 syncrep_method);
  
  /*
   * Bison doesn't allocate anything that needs to live across parser calls,
***************
*** 46,52 **** static SyncRepConfigData *create_syncrep_config(const char *num_sync,
  	SyncRepConfigData *config;
  }
  
! %token <str> NAME NUM JUNK
  
  %type <config> result standby_config
  %type <list> standby_list
--- 46,52 ----
  	SyncRepConfigData *config;
  }
  
! %token <str> NAME NUM JUNK ANY FIRST
  
  %type <config> result standby_config
  %type <list> standby_list
***************
*** 60,67 **** result:
  	;
  
  standby_config:
! 		standby_list				{ $$ = create_syncrep_config("1", $1); }
! 		| NUM '(' standby_list ')'	{ $$ = create_syncrep_config($1, $3); }
  	;
  
  standby_list:
--- 60,69 ----
  	;
  
  standby_config:
! 		standby_list				{ $$ = create_syncrep_config("1", $1, SYNC_REP_PRIORITY); }
! 		| NUM '(' standby_list ')'		{ $$ = create_syncrep_config($1, $3, SYNC_REP_PRIORITY); }
! 		| ANY NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_QUORUM); }
! 		| FIRST NUM '(' standby_list ')'		{ $$ = create_syncrep_config($2, $4, SYNC_REP_PRIORITY); }
  	;
  
  standby_list:
***************
*** 75,83 **** standby_name:
  	;
  %%
  
- 
  static SyncRepConfigData *
! create_syncrep_config(const char *num_sync, List *members)
  {
  	SyncRepConfigData *config;
  	int			size;
--- 77,84 ----
  	;
  %%
  
  static SyncRepConfigData *
! create_syncrep_config(const char *num_sync, List *members, uint8 syncrep_method)
  {
  	SyncRepConfigData *config;
  	int			size;
***************
*** 98,103 **** create_syncrep_config(const char *num_sync, List *members)
--- 99,105 ----
  
  	config->config_size = size;
  	config->num_sync = atoi(num_sync);
+ 	config->syncrep_method = syncrep_method;
  	config->nmembers = list_length(members);
  	ptr = config->member_names;
  	foreach(lc, members)
*** a/src/backend/replication/syncrep_scanner.l
--- b/src/backend/replication/syncrep_scanner.l
***************
*** 64,69 **** xdinside		[^"]+
--- 64,72 ----
  %%
  {space}+	{ /* ignore */ }
  
+ ANY		{ return ANY; }
+ FIRST		{ return FIRST; }
+ 
  {xdstart}	{
  				initStringInfo(&xdbuf);
  				BEGIN(xd);
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 2868,2879 **** pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
  
  			/*
  			 * More easily understood version of standby state. This is purely
! 			 * informational, not different from priority.
  			 */
  			if (priority == 0)
  				values[7] = CStringGetTextDatum("async");
  			else if (list_member_int(sync_standbys, i))
! 				values[7] = CStringGetTextDatum("sync");
  			else
  				values[7] = CStringGetTextDatum("potential");
  		}
--- 2868,2887 ----
  
  			/*
  			 * More easily understood version of standby state. This is purely
! 			 * informational.
! 			 *
! 			 * In quorum-based sync replication, the role of each standby
! 			 * listed in synchronous_standby_names can be changing very
! 			 * frequently. Any standbys considered as "sync" at one moment can
! 			 * be switched to "potential" ones at the next moment. So, it's
! 			 * basically useless to report "sync" or "potential" as their sync
! 			 * states. We report just "quorum" for them.
  			 */
  			if (priority == 0)
  				values[7] = CStringGetTextDatum("async");
  			else if (list_member_int(sync_standbys, i))
! 				values[7] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
! 					CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
  			else
  				values[7] = CStringGetTextDatum("potential");
  		}
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 245,251 ****
  # These settings are ignored on a standby server.
  
  #synchronous_standby_names = ''	# standby servers that provide sync rep
! 				# number of sync standbys and comma-separated list of application_name
  				# from standby(s); '*' = all
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
--- 245,252 ----
  # These settings are ignored on a standby server.
  
  #synchronous_standby_names = ''	# standby servers that provide sync rep
! 				# method to choose sync standbys, number of sync standbys
! 				# and comma-separated list of application_name
  				# from standby(s); '*' = all
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 32,37 ****
--- 32,41 ----
  #define SYNC_REP_WAITING			1
  #define SYNC_REP_WAIT_COMPLETE		2
  
+ /* syncrep_method of SyncRepConfigData */
+ #define SYNC_REP_PRIORITY		0
+ #define SYNC_REP_QUORUM		1
+ 
  /*
   * Struct for the configuration of synchronous replication.
   *
***************
*** 44,54 **** typedef struct SyncRepConfigData
--- 48,61 ----
  	int			config_size;	/* total size of this struct, in bytes */
  	int			num_sync;		/* number of sync standbys that we need to
  								 * wait for */
+ 	uint8		syncrep_method;	/* method to choose sync standbys */
  	int			nmembers;		/* number of members in the following list */
  	/* member_names contains nmembers consecutive nul-terminated C strings */
  	char		member_names[FLEXIBLE_ARRAY_MEMBER];
  } SyncRepConfigData;
  
+ extern SyncRepConfigData *SyncRepConfig;
+ 
  /* communication variables for parsing synchronous_standby_names GUC */
  extern SyncRepConfigData *syncrep_parse_result;
  extern char *syncrep_parse_error_msg;
*** a/src/test/recovery/t/007_sync_rep.pl
--- b/src/test/recovery/t/007_sync_rep.pl
***************
*** 3,9 **** use strict;
  use warnings;
  use PostgresNode;
  use TestLib;
! use Test::More tests => 8;
  
  # Query checking sync_priority and sync_state of each standby
  my $check_sql =
--- 3,9 ----
  use warnings;
  use PostgresNode;
  use TestLib;
! use Test::More tests => 11;
  
  # Query checking sync_priority and sync_state of each standby
  my $check_sql =
***************
*** 172,174 **** test_sync_state(
--- 172,205 ----
  standby2|1|sync
  standby4|1|potential),
  	'potential standby found earlier in array is promoted to sync');
+ 
+ # Check that standby1 and standby2 are chosen as sync standbys
+ # based on their priorities.
+ test_sync_state(
+ $node_master, qq(standby1|1|sync
+ standby2|2|sync
+ standby4|0|async),
+ 'priority-based sync replication specified by FIRST keyword',
+ 'FIRST 2(standby1, standby2)');
+ 
+ # Check that all the listed standbys are considered as candidates
+ # for sync standbys in a quorum-based sync replication.
+ test_sync_state(
+ $node_master, qq(standby1|1|quorum
+ standby2|2|quorum
+ standby4|0|async),
+ '2 quorum and 1 async',
+ 'ANY 2(standby1, standby2)');
+ 
+ # Start Standby3 which will be considered in 'quorum' state.
+ $node_standby_3->start;
+ 
+ # Check that the setting of 'ANY 2(*)' chooses all standbys as
+ # candidates for quorum sync standbys.
+ test_sync_state(
+ $node_master, qq(standby1|1|quorum
+ standby2|1|quorum
+ standby3|1|quorum
+ standby4|1|quorum),
+ 'all standbys are considered as candidates for quorum sync standbys',
+ 'ANY 2(*)');

#65

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Fujii Masao (#64)

Re: Quorum commit for multiple synchronous replication.

On Fri, Dec 16, 2016 at 10:42 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Attached is the modified version of the patch. Barring objections, I will
commit this version.

There is a whitespace:
$ git diff master --check
src/backend/replication/syncrep.c:39: trailing whitespace.
+ *

Even after committing the patch, there will be still many source comments
and documentations that we need to update, for example,
in high-availability.sgml. We need to check and update them throughly later.

The current patch is complicated enough, so that's fine for me. I
checked the patch one last time and that looks good.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Michael Paquier (#65)

Re: Quorum commit for multiple synchronous replication.

On Sun, Dec 18, 2016 at 9:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Dec 16, 2016 at 10:42 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Attached is the modified version of the patch. Barring objections, I will
commit this version.

There is a whitespace:
$ git diff master --check
src/backend/replication/syncrep.c:39: trailing whitespace.
+ *

Okey, pushed the patch with this fix. Thanks!

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

Any other?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Alvaro Herrera

alvherre@2ndquadrant.com

about 9 years ago

In reply to: Fujii Masao (#66)

Re: Quorum commit for multiple synchronous replication.

Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

Please list these in https://wiki.postgresql.org/wiki/Open_Items so that we
don't forget.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Fujii Masao (#66)

Re: Quorum commit for multiple synchronous replication.

On Mon, Dec 19, 2016 at 9:49 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Dec 18, 2016 at 9:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Dec 16, 2016 at 10:42 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Attached is the modified version of the patch. Barring objections, I will
commit this version.

There is a whitespace:
$ git diff master --check
src/backend/replication/syncrep.c:39: trailing whitespace.
+ *

Okey, pushed the patch with this fix. Thanks!

Thank you for reviewing and commit!

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

Will try to update them.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

Any other?

Do we need to consider the sorting method and the selecting k-th
latest LSN method?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#68)

Re: Quorum commit for multiple synchronous replication.

On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Do we need to consider the sorting method and the selecting k-th
latest LSN method?

Honestly, nah. Tests are showing that there are many more bottlenecks
before that with just memory allocation and parsing.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Alvaro Herrera (#67)

Re: Quorum commit for multiple synchronous replication.

On Tue, Dec 20, 2016 at 1:44 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

Please list these in https://wiki.postgresql.org/wiki/Open_Items so that we
don't forget.

Yep, added!

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Michael Paquier (#69)

Re: Quorum commit for multiple synchronous replication.

On Tue, Dec 20, 2016 at 2:46 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Do we need to consider the sorting method and the selecting k-th
latest LSN method?

Honestly, nah. Tests are showing that there are many more bottlenecks
before that with just memory allocation and parsing.

I think that it's worth prototyping alternative algorithm, and
measuring the performances of those alternative and current
algorithms. This measurement should check not only the bottleneck
but also how much each algorithm increases the time that backends
need to wait for before they receive ack from walsender.

If it's reported that current algorithm is enough "effecient",
we can just leave the code as it is. OTOH, if not, let's adopt
the alternative one.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Fujii Masao (#71)

Re: Quorum commit for multiple synchronous replication.

At Tue, 20 Dec 2016 23:47:22 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFcEhv8BPP0HV2VQ8kXaHQmfN7PFAgkKsPyVip0frizpg@mail.gmail.com>

On Tue, Dec 20, 2016 at 2:46 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Do we need to consider the sorting method and the selecting k-th
latest LSN method?

Honestly, nah. Tests are showing that there are many more bottlenecks
before that with just memory allocation and parsing.

I think that it's worth prototyping alternative algorithm, and
measuring the performances of those alternative and current
algorithms. This measurement should check not only the bottleneck
but also how much each algorithm increases the time that backends
need to wait for before they receive ack from walsender.

If it's reported that current algorithm is enough "effecient",
we can just leave the code as it is. OTOH, if not, let's adopt
the alternative one.

I'm personally interested in the difference of them but it
doesn't seem urgently required. If we have nothing particular to
do with this, considering other ordering method would be
valuable.

By a not-well-grounded thought though, maintaining top-kth list
by insertion sort would be promising rather than running top-kth
sorting on the whole list. Sorting on all walsenders is needed
for the first time and some other situation though.

By the way, do we continue dispu^h^hcussing on the format of
s_s_names and/or a successor right now?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Fujii Masao

masao.fujii@gmail.com

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#72)

Re: Quorum commit for multiple synchronous replication.

On Wed, Dec 21, 2016 at 10:39 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Tue, 20 Dec 2016 23:47:22 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFcEhv8BPP0HV2VQ8kXaHQmfN7PFAgkKsPyVip0frizpg@mail.gmail.com>

On Tue, Dec 20, 2016 at 2:46 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Dec 20, 2016 at 2:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Do we need to consider the sorting method and the selecting k-th
latest LSN method?

Honestly, nah. Tests are showing that there are many more bottlenecks
before that with just memory allocation and parsing.

I think that it's worth prototyping alternative algorithm, and
measuring the performances of those alternative and current
algorithms. This measurement should check not only the bottleneck
but also how much each algorithm increases the time that backends
need to wait for before they receive ack from walsender.

If it's reported that current algorithm is enough "effecient",
we can just leave the code as it is. OTOH, if not, let's adopt
the alternative one.

I'm personally interested in the difference of them but it
doesn't seem urgently required.

Yes, it's not urgent task.

If we have nothing particular to
do with this, considering other ordering method would be
valuable.

By a not-well-grounded thought though, maintaining top-kth list
by insertion sort would be promising rather than running top-kth
sorting on the whole list. Sorting on all walsenders is needed
for the first time and some other situation though.

By the way, do we continue dispu^h^hcussing on the format of
s_s_names and/or a successor right now?

Yes. If there is better approach, we should discuss that.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Noah Misch

noah@leadboat.com

almost 9 years ago

In reply to: Fujii Masao (#66)

Re: Quorum commit for multiple synchronous replication.

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Fujii Masao

masao.fujii@gmail.com

almost 9 years ago

In reply to: Noah Misch (#74)

Re: Quorum commit for multiple synchronous replication.

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Noah Misch

noah@leadboat.com

almost 9 years ago

In reply to: Fujii Masao (#75)

Re: Quorum commit for multiple synchronous replication.

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77

Petr Jelinek

petr.jelinek@2ndquadrant.com

almost 9 years ago

In reply to: Noah Misch (#76)

Re: Quorum commit for multiple synchronous replication.

On 06/04/17 03:51, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

I was one of the people who said in original thread that I think the
default behavior should change to quorum and I am still of that opinion.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Noah Misch (#76)

Re: Quorum commit for multiple synchronous replication.

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#78)

2 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Thu, Apr 6, 2017 at 4:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

Regards,

Attached two draft patches. The one makes pg_stat_replication.sync
priority report NULL if in quorum-based sync replication. To prevent
extra change I don't change so far the code of setting standby
priority. The another one improves the comment and documentation. If
there is more thing what we need to mention in documentation please
give me feedback.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

report_null_as_sync_priority.patchapplication/octet-stream; name=report_null_as_sync_priority.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d42a461..fe511b5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1749,7 +1749,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
      <entry><type>integer</></entry>
      <entry>Priority of this standby server for being chosen as the
       synchronous standby in a priority-based synchronous replication.
-      This has no effect in a quorum-based synchronous replication.</entry>
+      This value is NULL if in a quorum-based synchronous replication.</entry>
     </row>
     <row>
      <entry><structfield>sync_state</></entry>
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index dbb10c7..b57244b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3127,7 +3127,15 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			else
 				values[8] = IntervalPGetDatum(offset_to_interval(applyLag));
 
-			values[9] = Int32GetDatum(priority);
+			/*
+			 * The priority appers NULL as it is not used in quorum-based
+			 * sync replication.
+			 */
+			if (SyncRepConfig &&
+				SyncRepConfig->syncrep_method == SYNC_REP_QUORUM)
+				nulls[9] = true;
+			else
+				values[9] = Int32GetDatum(priority);
 
 			/*
 			 * More easily understood version of standby state. This is purely
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
index e11b428..f29b4db 100644
--- a/src/test/recovery/t/007_sync_rep.pl
+++ b/src/test/recovery/t/007_sync_rep.pl
@@ -185,9 +185,9 @@ standby4|0|async),
 # Check that all the listed standbys are considered as candidates
 # for sync standbys in a quorum-based sync replication.
 test_sync_state(
-$node_master, qq(standby1|1|quorum
-standby2|2|quorum
-standby4|0|async),
+$node_master, qq(standby1||quorum
+standby2||quorum
+standby4||async),
 '2 quorum and 1 async',
 'ANY 2(standby1, standby2)');
 
@@ -197,9 +197,9 @@ $node_standby_3->start;
 # Check that the setting of 'ANY 2(*)' chooses all standbys as
 # candidates for quorum sync standbys.
 test_sync_state(
-$node_master, qq(standby1|1|quorum
-standby2|1|quorum
-standby3|1|quorum
-standby4|1|quorum),
+$node_master, qq(standby1||quorum
+standby2||quorum
+standby3||quorum
+standby4||quorum),
 'all standbys are considered as candidates for quorum sync standbys',
 'ANY 2(*)');

quorum_repl_doc_improve.patchapplication/octet-stream; name=quorum_repl_doc_improve.patchDownload

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 51359d6..44fc1ee 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1202,6 +1202,21 @@ synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
    </para>
 
    <para>
+    In term of performance there is difference between two synchronous
+    replication method. Generally quorum-based synchronous replication
+    tends to be higher performance than priority-based synchronous
+    replication. Because in quorum-based synchronous replication, the
+    transaction can resume as soon as received the specified number of
+    acknowledgement from synchronous standby servers without distinction
+    of standby servers. On the other hand in priority-based synchronous
+    replication, the standby server that the primary server must wait for
+    is fixed until a synchronous standby fails. Therefore, if a server on
+    low-performance machine a has high priority and is chosen as a
+    synchronous standby server it can reduce performance for database
+    applications.
+   </para>
+   
+   <para>
     <productname>PostgreSQL</> allows the application developer
     to specify the durability level required via replication. This can be
     specified for the system overall, though it can also be specified for
@@ -1246,12 +1261,22 @@ synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
     The best solution for high availability is to ensure you keep as many
     synchronous standbys as requested. This can be achieved by naming multiple
     potential synchronous standbys using <varname>synchronous_standby_names</>.
-    The standbys whose names appear earlier in the list will be used as
-    synchronous standbys. Standbys listed after these will take over
-    the role of synchronous standby if one of current ones should fail.
+    For example in priority-based synchronous replication, the standbys whose
+    names appear earlier in the list will be used as synchronous standbys,
+    as described in <xref linkend="synchronous-replication-multiple-standbys">.
+    Standbys listed after these will take over the role of synchronous standby
+    if one of current ones should fail.
    </para>
 
    <para>
+    Whichever the synchronous replication method you choose, there is no
+    difference between two synchronous replication method, priority-based and
+    quorum-based, in term of high availability. Because in both replication
+    method the transaction can be proceeded as long as at least the specified
+    number of synchronous standby is running.
+  </para>
+
+   <para>
     When a standby first attaches to the primary, it will not yet be properly
     synchronized. This is described as <literal>catchup</> mode. Once
     the lag between standby and primary reaches zero for the first time
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 20a1441..8fba28f 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -53,6 +53,9 @@
  * in the list. All the standbys appearing in the list are considered as
  * candidates for quorum synchronous standbys.
  *
+ * The method is optional. When neither FIRST nor ANY is specified in
+ * synchronous_standby_names it's equivalent to specifying FIRST.
+ *
  * Before the standbys chosen from synchronous_standby_names can
  * become the synchronous standbys they must have caught up with
  * the primary; that may take some time. Once caught up,
@@ -385,6 +388,11 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
+		/*
+		 * Update priority of this WalSender, but note that in
+		 * quroum-based sync replication, the value of
+		 * sync_standby_priority has no effect.
+		 */
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 		MyWalSnd->sync_standby_priority = priority;
 		LWLockRelease(SyncRepLock);
@@ -599,6 +607,10 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 /*
  * Calculate the Nth latest Write, Flush and Apply positions among sync
  * standbys.
+ *
+ * XXX it costs O(n log n) but since we suppose the n is not large,
+ * maybe less than 10 in most cases, we can optimize it by another
+ * sorting algorithm.
  */
 static void
 SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
@@ -629,6 +641,7 @@ SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 		i++;
 	}
 
+	/* Sort each array in descending order */
 	qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
 	qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
 	qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);
@@ -688,6 +701,10 @@ SyncRepGetSyncStandbys(bool	*am_sync)
  * Return the list of all the candidates for quorum sync standbys,
  * or NIL if no such standby is connected.
  *
+ * In quorum-based sync replication we select the quorum sync
+ * standby without theirs priority. The all running active standbys
+ * are considered as a candidate for quorum sync standbys
+ *
  * The caller must hold SyncRepLock. This function must be called only in
  * a quorum-based sync replication.
  *

#80

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Masahiko Sawada (#78)

Re: Quorum commit for multiple synchronous replication.

Hello,

At Thu, 6 Apr 2017 16:17:31 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCcEsjt8t4TWW5oE3g=nu2oMFAiM47YeynpKJMoMdeEPA@mail.gmail.com>

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

In the comment,

+      /*
+       * The priority appers NULL as it is not used in quorum-based
+       * sync replication.
+       */

appers should be appears. But the comment would be better to be
something follows.

"The priority value is useless for quorum-based sync replication" or

"The priority field is NULL for quorum-based sync replication
since the value is useless."

Or, or, or.. something other.

This part,

+    if (SyncRepConfig &&
+        SyncRepConfig->syncrep_method == SYNC_REP_QUORUM)
+        nulls[9] = true;
+    else
+        values[9] = Int32GetDatum(priority);

I looked on how syncrep_method is used in the code and found that
it is always used as "== SYNC_REP_PRIORITY" or else. It doesn't
matter since currently there's only two alternatives for the
variable, but can be problematic when the third alternative comes
in.

Addition to that, SyncRepConfig is assumed != NULL already in the
following part.

pg_stat_get_wal_senders()@master

if (priority == 0)
values[10] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
else
values[10] = CStringGetTextDatum("potential");

So, it could be as the follows.

if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY)
values[9] = Int32GetDatum(priority);
else
nulls[9] = true;

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#80)

Re: Quorum commit for multiple synchronous replication.

On Thu, Apr 13, 2017 at 5:17 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

At Thu, 6 Apr 2017 16:17:31 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCcEsjt8t4TWW5oE3g=nu2oMFAiM47YeynpKJMoMdeEPA@mail.gmail.com>

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

In the comment,

Thank you for reviewing!

+      /*
+       * The priority appers NULL as it is not used in quorum-based
+       * sync replication.
+       */
appers should be appears. But the comment would be better to be
something follows.

Will fix.

"The priority value is useless for quorum-based sync replication" or

"The priority field is NULL for quorum-based sync replication
since the value is useless."

Or, or, or.. something other.

Will fix with later part.

This part,
+    if (SyncRepConfig &&
+        SyncRepConfig->syncrep_method == SYNC_REP_QUORUM)
+        nulls[9] = true;
+    else
+        values[9] = Int32GetDatum(priority);
I looked on how syncrep_method is used in the code and found that
it is always used as "== SYNC_REP_PRIORITY" or else. It doesn't
matter since currently there's only two alternatives for the
variable, but can be problematic when the third alternative comes
in.

Agreed.

Addition to that, SyncRepConfig is assumed != NULL already in the
following part.

pg_stat_get_wal_senders()@master

if (priority == 0)
values[10] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
else
values[10] = CStringGetTextDatum("potential");

So, it could be as the follows.

if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY)
values[9] = Int32GetDatum(priority);
else
nulls[9] = true;

I guess we cannot do so. Because in the above part, SyncRepConfig is
referenced only when synchronous replication is used we can assume
SyncRepConfig is not NULL there. Perhaps we put a assertion there.

I'll sent updated patch tomorrow.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Fujii Masao

masao.fujii@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#81)

Re: Quorum commit for multiple synchronous replication.

On Thu, Apr 13, 2017 at 9:23 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 13, 2017 at 5:17 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello,

At Thu, 6 Apr 2017 16:17:31 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCcEsjt8t4TWW5oE3g=nu2oMFAiM47YeynpKJMoMdeEPA@mail.gmail.com>

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

In the comment,

Thank you for reviewing!
+      /*
+       * The priority appers NULL as it is not used in quorum-based
+       * sync replication.
+       */
appers should be appears. But the comment would be better to be
something follows.
Will fix.

"The priority value is useless for quorum-based sync replication" or

"The priority field is NULL for quorum-based sync replication
since the value is useless."

Or, or, or.. something other.

Will fix with later part.
This part,
+    if (SyncRepConfig &&
+        SyncRepConfig->syncrep_method == SYNC_REP_QUORUM)
+        nulls[9] = true;
+    else
+        values[9] = Int32GetDatum(priority);
I looked on how syncrep_method is used in the code and found that
it is always used as "== SYNC_REP_PRIORITY" or else. It doesn't
matter since currently there's only two alternatives for the
variable, but can be problematic when the third alternative comes
in.
Agreed.

Addition to that, SyncRepConfig is assumed != NULL already in the
following part.

pg_stat_get_wal_senders()@master

if (priority == 0)
values[10] = CStringGetTextDatum("async");
else if (list_member_int(sync_standbys, i))
values[10] = SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY ?
CStringGetTextDatum("sync") : CStringGetTextDatum("quorum");
else
values[10] = CStringGetTextDatum("potential");

So, it could be as the follows.

if (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY)
values[9] = Int32GetDatum(priority);
else
nulls[9] = true;

I guess we cannot do so. Because in the above part, SyncRepConfig is
referenced only when synchronous replication is used we can assume
SyncRepConfig is not NULL there. Perhaps we put a assertion there.

I'll sent updated patch tomorrow.

Thanks!

But on second thought, I don't think that reporting NULL as the priority when
quorum-based sync replication is used is less confusing. When there is async
standby, we report 0 as its priority when synchronous_standby_names is empty
or a priority-based sync replication is configured. But with the patch, when
a quorum-based one is specified, NULL is reported for that.
Isn't this confusing?

I'm thinking that it's less confusing to report always 0 as the priority of
async standby whatever the setting of synchronous_standby_names is.
Thought?

If we adopt this idea, in a quorum-based sync replication, I think that
the priorities of all the standbys listed in synchronous_standby_names
should be 1 instead of NULL. That is, those standbys have the same
(highest) priority, and which means that any of them can be chosen as
sync standby. Thought?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Fujii Masao (#82)

Re: Quorum commit for multiple synchronous replication.

On Fri, Apr 14, 2017 at 2:47 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I'm thinking that it's less confusing to report always 0 as the priority of
async standby whatever the setting of synchronous_standby_names is.
Thought?

Or we could have priority being reported to NULL for async standbys as
well, the priority number has no meaning for them anyway...

If we adopt this idea, in a quorum-based sync replication, I think that
the priorities of all the standbys listed in synchronous_standby_names
should be 1 instead of NULL. That is, those standbys have the same
(highest) priority, and which means that any of them can be chosen as
sync standby. Thought?

Mainly my fault here to suggest that standbys in a quorum set should
have a priority set to NULL. My 2c on the matter is that I would be
fine with either having the async standbys having a priority of NULL
or using a priority of 1 for standbys in a quorum set. Though,
honestly, I find that showing a priority number for something where
this has no real meaning is even more confusing..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Michael Paquier (#83)

Re: Quorum commit for multiple synchronous replication.

On Fri, Apr 14, 2017 at 9:38 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Apr 14, 2017 at 2:47 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I'm thinking that it's less confusing to report always 0 as the priority of
async standby whatever the setting of synchronous_standby_names is.
Thought?

Or we could have priority being reported to NULL for async standbys as
well, the priority number has no meaning for them anyway...

I agree to set the same thing (priority or NULL) to all sync standby
in a quorum set. As Fujii-san mentioned, I also think that it means
all standbys in a quorum set can be chosen equally. But to less
confusion for current user I'd not like to change current behavior of
the priority of async standby.

If we adopt this idea, in a quorum-based sync replication, I think that
the priorities of all the standbys listed in synchronous_standby_names
should be 1 instead of NULL. That is, those standbys have the same
(highest) priority, and which means that any of them can be chosen as
sync standby. Thought?

Mainly my fault here to suggest that standbys in a quorum set should
have a priority set to NULL. My 2c on the matter is that I would be
fine with either having the async standbys having a priority of NULL
or using a priority of 1 for standbys in a quorum set. Though,
honestly, I find that showing a priority number for something where
this has no real meaning is even more confusing..

This is just a thought but we can merge sync_priority and sync_state
into one column. The sync priority can have meaning only when the
standby is considered as a sync standby or a potential standby in
priority-based sync replication. For example, we can show something
like 'sync:N' as states of the sync standby and 'potential:N' as
states of the potential standby in priority-based sync replication,
where N means the priority. In quorum-based sync replication it is
just 'quorum'. It breaks backward compatibility, though.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Masahiko Sawada (#84)

Re: Quorum commit for multiple synchronous replication.

At Fri, 14 Apr 2017 10:47:46 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoD7Scnjrn5m+_eaDEsZnyXpbwGYw7x1sXeipAK=iqBKUQ@mail.gmail.com>

On Fri, Apr 14, 2017 at 9:38 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Apr 14, 2017 at 2:47 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I'm thinking that it's less confusing to report always 0 as the priority of
async standby whatever the setting of synchronous_standby_names is.
Thought?

Or we could have priority being reported to NULL for async standbys as
well, the priority number has no meaning for them anyway...

I agree to set the same thing (priority or NULL) to all sync standby
in a quorum set. As Fujii-san mentioned, I also think that it means
all standbys in a quorum set can be chosen equally. But to less
confusion for current user I'd not like to change current behavior of
the priority of async standby.

If we adopt this idea, in a quorum-based sync replication, I think that
the priorities of all the standbys listed in synchronous_standby_names
should be 1 instead of NULL. That is, those standbys have the same
(highest) priority, and which means that any of them can be chosen as
sync standby. Thought?

Mainly my fault here to suggest that standbys in a quorum set should
have a priority set to NULL. My 2c on the matter is that I would be
fine with either having the async standbys having a priority of NULL
or using a priority of 1 for standbys in a quorum set. Though,
honestly, I find that showing a priority number for something where
this has no real meaning is even more confusing..

This is just a thought but we can merge sync_priority and sync_state
into one column. The sync priority can have meaning only when the
standby is considered as a sync standby or a potential standby in
priority-based sync replication. For example, we can show something
like 'sync:N' as states of the sync standby and 'potential:N' as
states of the potential standby in priority-based sync replication,
where N means the priority. In quorum-based sync replication it is
just 'quorum'. It breaks backward compatibility, though.

I'm not sure how the sync_priority is used, I know sync_state is
used to detect the state or soundness of a replication set.
Introducing varialbe part wouldn't be welcomed from such people.

The current shape of pg_stat_replication is as follows.

Fot this case, the following query will work.

SELECT count(*) > 0 FROM pg_stat_replication WHERE sync_state ='sync'

Maybe a bit confusing but we can use the field to show how many
hosts are required to conform the quorum. For example the case
with s_s_names = 'ANY 3 (sby1,sby2,sby3,sby4)'.

In this case, we can detect satisfaction of the quorum setup by
something like this.

SELECT count(*) >= sync_priority FROM pg_stat_replication WHERE
sync_state='quorum' GROUP BY sync_priority;

But, maybe we should provide a means to detect the standbys
really in sync with the master. This doesn't give such
information.

We could show top N standbys as priority-1 and others as
priority-2. (Of course this requires some additional
computation.)

application_name | flush_location | sync_priority | sync_state
-----------------+----------------+---------------+-----------
sby1 | 0/700140 | 1 | quorum
sby4 | 0/700100 | 1 | quorum
sby2 | 0/700080 | 1 | quorum
sby3 | 0/6FFF3e | 2 | quorum
sby3 | 0/50e345 | 2 | quorum
sby5 | 0/700140 | 0 | async

In this case, the soundness of the quorum set is checked by the
following query.

SELECT count(*) > 0 FROM pg_stat_replication WHERE sync_priority > 0;

We will find the standbys 'in sync' by the following query.

SELECT application_name FROM pg_stat_replication WHERE sync_priority = 1;

If the master doesn't have enough standbys. We could show the
state as the follows.. perhaps...

application_name | flush_location | sync_priority | sync_state
-----------------+----------------+---------------+-----------
sby1 | 0/700140 | 0 | quorum
sby4 | 0/700100 | 0 | quorum
sby5 | 0/700140 | 0 | async

Or we can use 'quorum-potential' instead of the 'quorum' above.

Or, we might be able to keep backward compatibility in a sense.

application_name | flush_location | sync_priority | sync_state
-----------------+----------------+---------------+-----------
sby1 | 0/700140 | 1 | sync
sby4 | 0/700100 | 1 | sync
sby2 | 0/700080 | 1 | sync
sby3 | 0/6FFF3e | 2 | potential
sby3 | 0/50e345 | 2 | potential
sby5 | 0/700140 | 0 | async

In the above discussion, I didn't consider possible future
exntensions of this feature.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Simon Riggs

simon@2ndquadrant.com

over 8 years ago

In reply to: Fujii Masao (#82)

Re: Quorum commit for multiple synchronous replication.

On 13 April 2017 at 18:47, Fujii Masao <masao.fujii@gmail.com> wrote:

But on second thought, I don't think that reporting NULL as the priority when
quorum-based sync replication is used is less confusing. When there is async
standby, we report 0 as its priority when synchronous_standby_names is empty
or a priority-based sync replication is configured. But with the patch, when
a quorum-based one is specified, NULL is reported for that.
Isn't this confusing?

To me, yes, it is confusing.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Noah Misch

noah@leadboat.com

over 8 years ago

In reply to: Noah Misch (#76)

Re: Quorum commit for multiple synchronous replication.

On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

Likewise.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Noah Misch

noah@leadboat.com

over 8 years ago

In reply to: Noah Misch (#87)

Re: Quorum commit for multiple synchronous replication.

On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:

On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

Likewise.

IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due
for your status update. Please reacquaint yourself with the policy on open
item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and then reply immediately. If I do not hear from you by
2017-04-17 05:00 UTC, I will transfer this item to release management team
ownership without further notice.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Fujii Masao

masao.fujii@gmail.com

over 8 years ago

In reply to: Noah Misch (#88)

Re: Quorum commit for multiple synchronous replication.

On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:

On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Sorry for the delay.

I will review Sawada-san's patch and commit something in next three days.
So next target date is April 19th.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

Likewise.

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Fujii Masao

masao.fujii@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#79)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Wed, Apr 12, 2017 at 2:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 4:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

Regards,

Attached two draft patches. The one makes pg_stat_replication.sync
priority report NULL if in quorum-based sync replication. To prevent
extra change I don't change so far the code of setting standby
priority. The another one improves the comment and documentation. If
there is more thing what we need to mention in documentation please
give me feedback.

Attached is the modified version of the doc improvement patch.
Barring any objection, I will commit this version.

+    In term of performance there is difference between two synchronous
+    replication method. Generally quorum-based synchronous replication
+    tends to be higher performance than priority-based synchronous
+    replication. Because in quorum-based synchronous replication, the
+    transaction can resume as soon as received the specified number of
+    acknowledgement from synchronous standby servers without distinction
+    of standby servers. On the other hand in priority-based synchronous
+    replication, the standby server that the primary server must wait for
+    is fixed until a synchronous standby fails. Therefore, if a server on
+    low-performance machine a has high priority and is chosen as a
+    synchronous standby server it can reduce performance for database
+    applications.

This description looks misleading. A quorum-based sync rep is basically
more efficient when there are multiple standbys in s_s_names and you want
to replicate the transactions to some of them synchronously. I think that
this assumption should be documented explicitly. So I modified this
description. Please see the modified version in the attached patch.

+ /*
+ * Update priority of this WalSender, but note that in
+ * quroum-based sync replication, the value of
+ * sync_standby_priority has no effect.
+ */

This is not true because even quorum-based sync rep uses the priority
value to check whether the standby is async or sync. So I just remove this.

+ * In quorum-based sync replication we select the quorum sync
+ * standby without theirs priority. The all running active standbys
+ * are considered as a candidate for quorum sync standbys

Same as above.

Also I removed some descriptions that I thought unnecessary to add.

Regards,

--
Fujii Masao

Attachments:

quorum_repl_doc_improve_v2.patchapplication/octet-stream; name=quorum_repl_doc_improve_v2.patchDownload

*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1084,1091 **** primary_slot_name = 'node_a_slot'
      In the case that <varname>synchronous_commit</> is set to
      <literal>remote_apply</>, the standby sends reply messages when the commit
      record is replayed, making the transaction visible.
!     If the standby is chosen as a synchronous standby, from a priority
!     list of <varname>synchronous_standby_names</> on the primary, the reply
      messages from that standby will be considered along with those from other
      synchronous standbys to decide when to release transactions waiting for
      confirmation that the commit record has been received. These parameters
--- 1084,1091 ----
      In the case that <varname>synchronous_commit</> is set to
      <literal>remote_apply</>, the standby sends reply messages when the commit
      record is replayed, making the transaction visible.
!     If the standby is chosen as a synchronous standby, according to the setting
!     of <varname>synchronous_standby_names</> on the primary, the reply
      messages from that standby will be considered along with those from other
      synchronous standbys to decide when to release transactions waiting for
      confirmation that the commit record has been received. These parameters
***************
*** 1228,1233 **** synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
--- 1228,1247 ----
      the rate of generation of WAL data.
     </para>
  
+    <para>
+     A quorum-based synchronous replication is basically more efficient than
+     a priority-based one when you specify multiple standbys in
+     <varname>synchronous_standby_names</> and want to replicate
+     the transactions to some of them synchronously. In this case,
+     the transactions in a priority-based synchronous replication must wait for
+     reply from the slowest standby in synchronous standbys chosen based on
+     their priorities, and which may increase the transaction latencies.
+     On the other hand, using a quorum-based synchronous replication may
+     improve those latencies because it makes the transactions wait only for
+     replies from the requested number of faster standbys in all the listed
+     standbys, i.e., such slow standby doesn't block the transactions.
+    </para>
+ 
     </sect3>
  
     <sect3 id="synchronous-replication-ha">
***************
*** 1246,1254 **** synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
      The best solution for high availability is to ensure you keep as many
      synchronous standbys as requested. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!     The standbys whose names appear earlier in the list will be used as
!     synchronous standbys. Standbys listed after these will take over
!     the role of synchronous standby if one of current ones should fail.
     </para>
  
     <para>
--- 1260,1279 ----
      The best solution for high availability is to ensure you keep as many
      synchronous standbys as requested. This can be achieved by naming multiple
      potential synchronous standbys using <varname>synchronous_standby_names</>.
!    </para>
! 
!    <para>
!     In a priority-based synchronous replication, the standbys whose names
!     appear earlier in the list will be used as synchronous standbys.
!     Standbys listed after these will take over the role of synchronous standby
!     if one of current ones should fail.
!    </para>
! 
!    <para>
!     In a quorum-based synchronous replication, all the standbys appearing
!     in the list will be used as candidates for synchronous standbys.
!     Even if one of them should fail, the other standbys will keep performing
!     the role of candidates of synchronous standby.
     </para>
  
     <para>
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 53,58 ****
--- 53,62 ----
   * in the list. All the standbys appearing in the list are considered as
   * candidates for quorum synchronous standbys.
   *
+  * If neither FIRST nor ANY is specified, FIRST is used as the method.
+  * This is for backward compatibility with 9.6 or before where only a
+  * priority-based sync replication was supported.
+  *
   * Before the standbys chosen from synchronous_standby_names can
   * become the synchronous standbys they must have caught up with
   * the primary; that may take some time. Once caught up,
***************
*** 629,634 **** SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
--- 633,639 ----
  		i++;
  	}
  
+ 	/* Sort each array in descending order */
  	qsort(write_array, len, sizeof(XLogRecPtr), cmp_lsn);
  	qsort(flush_array, len, sizeof(XLogRecPtr), cmp_lsn);
  	qsort(apply_array, len, sizeof(XLogRecPtr), cmp_lsn);

#91

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Fujii Masao (#90)

Re: Quorum commit for multiple synchronous replication.

On Tue, Apr 18, 2017 at 3:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Apr 12, 2017 at 2:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 4:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

Regards,

Attached two draft patches. The one makes pg_stat_replication.sync
priority report NULL if in quorum-based sync replication. To prevent
extra change I don't change so far the code of setting standby
priority. The another one improves the comment and documentation. If
there is more thing what we need to mention in documentation please
give me feedback.

Attached is the modified version of the doc improvement patch.
Barring any objection, I will commit this version.

Thank you for updating the patch.

+    In term of performance there is difference between two synchronous
+    replication method. Generally quorum-based synchronous replication
+    tends to be higher performance than priority-based synchronous
+    replication. Because in quorum-based synchronous replication, the
+    transaction can resume as soon as received the specified number of
+    acknowledgement from synchronous standby servers without distinction
+    of standby servers. On the other hand in priority-based synchronous
+    replication, the standby server that the primary server must wait for
+    is fixed until a synchronous standby fails. Therefore, if a server on
+    low-performance machine a has high priority and is chosen as a
+    synchronous standby server it can reduce performance for database
+    applications.
This description looks misleading. A quorum-based sync rep is basically
more efficient when there are multiple standbys in s_s_names and you want
to replicate the transactions to some of them synchronously. I think that
this assumption should be documented explicitly. So I modified this
description. Please see the modified version in the attached patch.

You're right. The modified version looks good to me, thanks.

+ /*
+ * Update priority of this WalSender, but note that in
+ * quroum-based sync replication, the value of
+ * sync_standby_priority has no effect.
+ */
This is not true because even quorum-based sync rep uses the priority
value to check whether the standby is async or sync. So I just remove this.
+ * In quorum-based sync replication we select the quorum sync
+ * standby without theirs priority. The all running active standbys
+ * are considered as a candidate for quorum sync standbys
Same as above.

Also I removed some descriptions that I thought unnecessary to add.

Regards,

--
Fujii Masao

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Masahiko Sawada (#91)

Re: Quorum commit for multiple synchronous replication.

At Tue, 18 Apr 2017 14:58:50 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBqSjUGx0LCDrjEDLB-yx2EvgLMdT8Nz4ZR_xpxrbMU+Q@mail.gmail.com>

On Tue, Apr 18, 2017 at 3:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Apr 12, 2017 at 2:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 4:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

Regards,

Attached two draft patches. The one makes pg_stat_replication.sync
priority report NULL if in quorum-based sync replication. To prevent
extra change I don't change so far the code of setting standby
priority. The another one improves the comment and documentation. If
there is more thing what we need to mention in documentation please
give me feedback.

Attached is the modified version of the doc improvement patch.
Barring any objection, I will commit this version.

Thank you for updating the patch.
+    In term of performance there is difference between two synchronous
+    replication method. Generally quorum-based synchronous replication
+    tends to be higher performance than priority-based synchronous
+    replication. Because in quorum-based synchronous replication, the
+    transaction can resume as soon as received the specified number of
+    acknowledgement from synchronous standby servers without distinction
+    of standby servers. On the other hand in priority-based synchronous
+    replication, the standby server that the primary server must wait for
+    is fixed until a synchronous standby fails. Therefore, if a server on
+    low-performance machine a has high priority and is chosen as a
+    synchronous standby server it can reduce performance for database
+    applications.
This description looks misleading. A quorum-based sync rep is basically
more efficient when there are multiple standbys in s_s_names and you want
to replicate the transactions to some of them synchronously. I think that
this assumption should be documented explicitly. So I modified this
description. Please see the modified version in the attached patch.
You're right. The modified version looks good to me, thanks.

It looks better to me, too. But (even I'm not sure, of course)
the sentences seem to need improvement.

| <para>
| Quorum-based synchronous replication is basically more
| efficient than priority-based one when you specify multiple
| standbys in <varname>synchronous_standby_names</> and want
| to synchronously replicate transactions to two or more of
| them. In the priority-based case, the replication master
| must wait for a reply from the slowest standby in the
| required number of standbys in priority order, which may
| slower than the rest. On the other hand, quorum-based
| synchronous replication may reduce the latency because it
| allows transactions to wait only for replies from a
| required number of fastest standbys in all the listed
| standbys, i.e., such slow standby doesn't block
| transactions.
| </para>

I'm not sure that this is actually an improvement..

+ /*
+ * Update priority of this WalSender, but note that in
+ * quroum-based sync replication, the value of
+ * sync_standby_priority has no effect.
+ */
This is not true because even quorum-based sync rep uses the priority
value to check whether the standby is async or sync. So I just remove this.
+ * In quorum-based sync replication we select the quorum sync
+ * standby without theirs priority. The all running active standbys
+ * are considered as a candidate for quorum sync standbys
Same as above.

Also I removed some descriptions that I thought unnecessary to add.

Regards,

--
Fujii Masao
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#92)

Re: Quorum commit for multiple synchronous replication.

On Tue, Apr 18, 2017 at 6:40 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Tue, 18 Apr 2017 14:58:50 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBqSjUGx0LCDrjEDLB-yx2EvgLMdT8Nz4ZR_xpxrbMU+Q@mail.gmail.com>
On Tue, Apr 18, 2017 at 3:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Apr 12, 2017 at 2:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 4:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

Regards,

Attached two draft patches. The one makes pg_stat_replication.sync
priority report NULL if in quorum-based sync replication. To prevent
extra change I don't change so far the code of setting standby
priority. The another one improves the comment and documentation. If
there is more thing what we need to mention in documentation please
give me feedback.

Attached is the modified version of the doc improvement patch.
Barring any objection, I will commit this version.

Thank you for updating the patch.
+    In term of performance there is difference between two synchronous
+    replication method. Generally quorum-based synchronous replication
+    tends to be higher performance than priority-based synchronous
+    replication. Because in quorum-based synchronous replication, the
+    transaction can resume as soon as received the specified number of
+    acknowledgement from synchronous standby servers without distinction
+    of standby servers. On the other hand in priority-based synchronous
+    replication, the standby server that the primary server must wait for
+    is fixed until a synchronous standby fails. Therefore, if a server on
+    low-performance machine a has high priority and is chosen as a
+    synchronous standby server it can reduce performance for database
+    applications.
This description looks misleading. A quorum-based sync rep is basically
more efficient when there are multiple standbys in s_s_names and you want
to replicate the transactions to some of them synchronously. I think that
this assumption should be documented explicitly. So I modified this
description. Please see the modified version in the attached patch.
You're right. The modified version looks good to me, thanks.
It looks better to me, too. But (even I'm not sure, of course)
the sentences seem to need improvement.

| <para>
| Quorum-based synchronous replication is basically more
| efficient than priority-based one when you specify multiple
| standbys in <varname>synchronous_standby_names</> and want
| to synchronously replicate transactions to two or more of
| them. In the priority-based case, the replication master
| must wait for a reply from the slowest standby in the
| required number of standbys in priority order, which may
| slower than the rest.

I supposed that Fujii-san pointed out that quorum-based sync
replication could be more efficient when we want to replicate the
transaction to "part of" standbys listed in s_s_names. So I guess it's
not good idea to mention "two or more of them" which also can mean the
all of standbys.

On the other hand, quorum-based
| synchronous replication may reduce the latency because it
| allows transactions to wait only for replies from a
| required number of fastest standbys in all the listed
| standbys, i.e., such slow standby doesn't block
| transactions.
| </para>

I'm not sure that this is actually an improvement..

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Fujii Masao

masao.fujii@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#93)

Re: Quorum commit for multiple synchronous replication.

On Tue, Apr 18, 2017 at 7:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Apr 18, 2017 at 6:40 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Tue, 18 Apr 2017 14:58:50 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBqSjUGx0LCDrjEDLB-yx2EvgLMdT8Nz4ZR_xpxrbMU+Q@mail.gmail.com>
On Tue, Apr 18, 2017 at 3:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Apr 12, 2017 at 2:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 4:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 6, 2017 at 10:51 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

(2)
There will be still many source comments and documentations that
we need to update, for example, in high-availability.sgml. We need to
check and update them throughly.

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Fujii,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Thanks for the notice!

Regarding the item (2), Sawada-san told me that he will work on it after
this CommitFest finishes. So we would receive the patch for the item from
him next week. If there will be no patch even after the end of next week
(i.e., April 14th), I will. Let's wait for Sawada-san's action at first.

Sounds reasonable; I will look for your update on 14Apr or earlier.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

BTW, IMO (3) should be fixed so that pg_stat_replication reports NULL
as the priority if quorum-based sync rep is chosen. It's less confusing.

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

I agree to report NULL as the priority. I'll send a patch for this as well.

Regards,

Attached two draft patches. The one makes pg_stat_replication.sync
priority report NULL if in quorum-based sync replication. To prevent
extra change I don't change so far the code of setting standby
priority. The another one improves the comment and documentation. If
there is more thing what we need to mention in documentation please
give me feedback.

Attached is the modified version of the doc improvement patch.
Barring any objection, I will commit this version.

Thank you for updating the patch.
+    In term of performance there is difference between two synchronous
+    replication method. Generally quorum-based synchronous replication
+    tends to be higher performance than priority-based synchronous
+    replication. Because in quorum-based synchronous replication, the
+    transaction can resume as soon as received the specified number of
+    acknowledgement from synchronous standby servers without distinction
+    of standby servers. On the other hand in priority-based synchronous
+    replication, the standby server that the primary server must wait for
+    is fixed until a synchronous standby fails. Therefore, if a server on
+    low-performance machine a has high priority and is chosen as a
+    synchronous standby server it can reduce performance for database
+    applications.
This description looks misleading. A quorum-based sync rep is basically
more efficient when there are multiple standbys in s_s_names and you want
to replicate the transactions to some of them synchronously. I think that
this assumption should be documented explicitly. So I modified this
description. Please see the modified version in the attached patch.
You're right. The modified version looks good to me, thanks.
It looks better to me, too. But (even I'm not sure, of course)
the sentences seem to need improvement.

| <para>
| Quorum-based synchronous replication is basically more
| efficient than priority-based one when you specify multiple
| standbys in <varname>synchronous_standby_names</> and want
| to synchronously replicate transactions to two or more of
| them. In the priority-based case, the replication master
| must wait for a reply from the slowest standby in the
| required number of standbys in priority order, which may
| slower than the rest.
I supposed that Fujii-san pointed out that quorum-based sync
replication could be more efficient when we want to replicate the
transaction to "part of" standbys listed in s_s_names.

Yes.

Anyway, I pushed the patch except this paragraph.
Regarding this paragraph, the patch for better descriptions is welcome.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Noah Misch

noah@leadboat.com

over 8 years ago

In reply to: Fujii Masao (#89)

Re: Quorum commit for multiple synchronous replication.

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:

On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

Likewise.

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Noah Misch (#95)

Re: Quorum commit for multiple synchronous replication.

On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:

On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

Likewise.

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

Yes, I think that's correct.

FWIW the reason of current behavior is that it would be useful for the
user who is willing to switch from ANY to FIRST. They can know which
standbys will become sync or potential.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#96)

Re: Quorum commit for multiple synchronous replication.

On Wed, Apr 19, 2017 at 1:52 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:

On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

Likewise.

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

Yes, I think that's correct.

Just adding that I am the only one advocating for switching the
priority number to NULL for async standbys, and that this proposal is
visibly outvoted as it breaks backward-compatibility with the
0-priority setting.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Fujii Masao (#94)

Re: Quorum commit for multiple synchronous replication.

At Wed, 19 Apr 2017 03:03:38 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwE95S5GM9UZh0F3ef2D3iEwJ59skh=EwW5HmDJPe2aXog@mail.gmail.com>

On Tue, Apr 18, 2017 at 7:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Apr 18, 2017 at 6:40 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Tue, 18 Apr 2017 14:58:50 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBqSjUGx0LCDrjEDLB-yx2EvgLMdT8Nz4ZR_xpxrbMU+Q@mail.gmail.com>

On Tue, Apr 18, 2017 at 3:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Apr 12, 2017 at 2:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
This description looks misleading. A quorum-based sync rep is basically
more efficient when there are multiple standbys in s_s_names and you want
to replicate the transactions to some of them synchronously. I think that
this assumption should be documented explicitly. So I modified this
description. Please see the modified version in the attached patch.

You're right. The modified version looks good to me, thanks.

+     A quorum-based synchronous replication is basically more efficient than
+     a priority-based one when you specify multiple standbys in
+     <varname>synchronous_standby_names</> and want to replicate
+     the transactions to some of them synchronously. In this case,
+     the transactions in a priority-based synchronous replication must wait for
+     reply from the slowest standby in synchronous standbys chosen based on
+     their priorities, and which may increase the transaction latencies.
+     On the other hand, using a quorum-based synchronous replication may
+     improve those latencies because it makes the transactions wait only for
+     replies from the requested number of faster standbys in all the listed
+     standbys, i.e., such slow standby doesn't block the transactions.

It looks better to me, too. But (even I'm not sure, of course)
the sentences seem to need improvement.

| <para>
| Quorum-based synchronous replication is basically more
| efficient than priority-based one when you specify multiple
| standbys in <varname>synchronous_standby_names</> and want
| to synchronously replicate transactions to two or more of
| them. In the priority-based case, the replication master
| must wait for a reply from the slowest standby in the
| required number of standbys in priority order, which may
| slower than the rest.

I supposed that Fujii-san pointed out that quorum-based sync
replication could be more efficient when we want to replicate the
transaction to "part of" standbys listed in s_s_names.

Yes.

Yes, am I wrote something opposing?

Anyway, I pushed the patch except this paragraph.
Regarding this paragraph, the patch for better descriptions is welcome.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#98)

Re: Quorum commit for multiple synchronous replication.

Ok, I got the point.

At Wed, 19 Apr 2017 17:39:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170419.173901.16598616.horiguchi.kyotaro@lab.ntt.co.jp>

| <para>
| Quorum-based synchronous replication is basically more
| efficient than priority-based one when you specify multiple
| standbys in <varname>synchronous_standby_names</> and want
| to synchronously replicate transactions to two or more of
| them.

"Some" means "not all".

| In the priority-based case, the replication master
| must wait for a reply from the slowest standby in the
| required number of standbys in priority order, which may
| slower than the rest.

Quorum-based synchronous replication is expected to be more
efficient than priority-based one when your master doesn't need
to be in sync with all of the nominated standbys by
<varname>synchronous_standby_names</>. While quorum-based
replication master waits only for a specified number of fastest
standbys, priority-based replicatoin master must wait for
standbys at the top of the list, which may be slower than the
rest.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Noah Misch

noah@leadboat.com

over 8 years ago

In reply to: Masahiko Sawada (#96)

Re: Quorum commit for multiple synchronous replication.

On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:

On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:

On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

Likewise.

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

Yes, I think that's correct.

Okay, but ...

FWIW the reason of current behavior is that it would be useful for the
user who is willing to switch from ANY to FIRST. They can know which
standbys will become sync or potential.

... does this mean you personally want to keep the current behavior? If not,
has some other person stated a wish to keep the current behavior?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Noah Misch (#100)

Re: Quorum commit for multiple synchronous replication.

On Fri, Apr 21, 2017 at 12:02 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:

On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:

On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

Likewise.

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

Yes, I think that's correct.

Okay, but ...

FWIW the reason of current behavior is that it would be useful for the
user who is willing to switch from ANY to FIRST. They can know which
standbys will become sync or potential.

... does this mean you personally want to keep the current behavior? If not,
has some other person stated a wish to keep the current behavior?

No, I want to change the current behavior. IMO it's better to set
priority 1 to all standbys in quorum set. I guess there is no longer
person who supports the current behavior.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Masahiko Sawada (#101)

Re: Quorum commit for multiple synchronous replication.

At Fri, 21 Apr 2017 13:20:05 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCU+ch4b2O0iW-b_BnUs7oMcT8pcwM690XVu134k=cA+Q@mail.gmail.com>

On Fri, Apr 21, 2017 at 12:02 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:

On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

On Sun, Apr 16, 2017 at 1:19 PM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Apr 14, 2017 at 11:58:23PM -0400, Noah Misch wrote:

On Wed, Apr 05, 2017 at 09:51:02PM -0400, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

(3)
The priority value is assigned to each standby listed in s_s_names
even in quorum commit though those priority values are not used at all.
Users can see those priority values in pg_stat_replication.
Isn't this confusing? If yes, it might be better to always assign 1 as
the priority, for example.

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

Since you do want (3) to change, please own it like any other open item,
including the mandatory status updates.

Likewise.

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

Yes, I think that's correct.

Okay, but ...

FWIW the reason of current behavior is that it would be useful for the
user who is willing to switch from ANY to FIRST. They can know which
standbys will become sync or potential.

... does this mean you personally want to keep the current behavior? If not,
has some other person stated a wish to keep the current behavior?

No, I want to change the current behavior. IMO it's better to set
priority 1 to all standbys in quorum set. I guess there is no longer
person who supports the current behavior.

+1 for the latter. For the former, I'd like to distinguish
standbys in sync and not in the field or something if we can
allow the additional complexity.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Noah Misch

noah@leadboat.com

over 8 years ago

In reply to: Masahiko Sawada (#101)

Re: Quorum commit for multiple synchronous replication.

On Fri, Apr 21, 2017 at 01:20:05PM +0900, Masahiko Sawada wrote:

On Fri, Apr 21, 2017 at 12:02 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:

On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

Yes, I think that's correct.

Okay, but ...

FWIW the reason of current behavior is that it would be useful for the
user who is willing to switch from ANY to FIRST. They can know which
standbys will become sync or potential.

... does this mean you personally want to keep the current behavior? If not,
has some other person stated a wish to keep the current behavior?

No, I want to change the current behavior. IMO it's better to set
priority 1 to all standbys in quorum set. I guess there is no longer
person who supports the current behavior.

In that case, this open item is not eligible for section "Design Decisions to
Recheck Mid-Beta". That section is for items where we'll probably change
nothing, but we plan to recheck later just in case. Here, we expect to change
the behavior; the open question is which replacement behavior to prefer.

Fujii, as the owner of this open item, you are responsible for moderating the
debate until there's adequate consensus to make a particular change or to keep
the current behavior after all. Please proceed to do that. Beta testers
deserve a UI they may like, not a UI you already plan to change later.

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Noah Misch

noah@leadboat.com

over 8 years ago

In reply to: Noah Misch (#103)

Re: Quorum commit for multiple synchronous replication.

On Thu, Apr 20, 2017 at 11:34:34PM -0700, Noah Misch wrote:

On Fri, Apr 21, 2017 at 01:20:05PM +0900, Masahiko Sawada wrote:

On Fri, Apr 21, 2017 at 12:02 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:

On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

Yes, I think that's correct.

Okay, but ...

FWIW the reason of current behavior is that it would be useful for the
user who is willing to switch from ANY to FIRST. They can know which
standbys will become sync or potential.

... does this mean you personally want to keep the current behavior? If not,
has some other person stated a wish to keep the current behavior?

No, I want to change the current behavior. IMO it's better to set
priority 1 to all standbys in quorum set. I guess there is no longer
person who supports the current behavior.

In that case, this open item is not eligible for section "Design Decisions to
Recheck Mid-Beta". That section is for items where we'll probably change
nothing, but we plan to recheck later just in case. Here, we expect to change
the behavior; the open question is which replacement behavior to prefer.

Fujii, as the owner of this open item, you are responsible for moderating the
debate until there's adequate consensus to make a particular change or to keep
the current behavior after all. Please proceed to do that. Beta testers
deserve a UI they may like, not a UI you already plan to change later.

Please observe the policy on open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update
within three calendar days of this message. Include a date for your
subsequent status update.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#99)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Thu, Apr 20, 2017 at 9:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Ok, I got the point.

At Wed, 19 Apr 2017 17:39:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170419.173901.16598616.horiguchi.kyotaro@lab.ntt.co.jp>

| <para>
| Quorum-based synchronous replication is basically more
| efficient than priority-based one when you specify multiple
| standbys in <varname>synchronous_standby_names</> and want
| to synchronously replicate transactions to two or more of
| them.

"Some" means "not all".

| In the priority-based case, the replication master
| must wait for a reply from the slowest standby in the
| required number of standbys in priority order, which may
| slower than the rest.

Quorum-based synchronous replication is expected to be more
efficient than priority-based one when your master doesn't need
to be in sync with all of the nominated standbys by
<varname>synchronous_standby_names</>. While quorum-based
replication master waits only for a specified number of fastest
standbys, priority-based replicatoin master must wait for
standbys at the top of the list, which may be slower than the
rest.

This description looks good to me. I've updated the patch based on
this description and attached it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

quorum_repl_doc_improve_v3.patchapplication/octet-stream; name=quorum_repl_doc_improve_v3.patchDownload

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 9e2be5f..9a3a498 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1228,6 +1228,16 @@ synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
     the rate of generation of WAL data.
    </para>
 
+   <para>
+    A quorum-based synchronous replication is basically more efficient than
+    a priority-based one when you specify multiple standbys in
+    <varname>synchronous_standby_names</> and your master doesn't need to
+    be in synchronous with all of the nominated standbys by it. While quorum-based
+    synchronous replication master waits only for a specified number of fastest
+    standbys, priority-based synchronous replication master must wait for standbys
+    chosen based on their priorities, which may be slower than others.
+   </para>
+
    </sect3>
 
    <sect3 id="synchronous-replication-ha">

#106

Fujii Masao

masao.fujii@gmail.com

over 8 years ago

In reply to: Noah Misch (#104)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Mon, Apr 24, 2017 at 9:02 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 20, 2017 at 11:34:34PM -0700, Noah Misch wrote:

On Fri, Apr 21, 2017 at 01:20:05PM +0900, Masahiko Sawada wrote:

On Fri, Apr 21, 2017 at 12:02 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:

On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

Yes, I think that's correct.

Okay, but ...

FWIW the reason of current behavior is that it would be useful for the
user who is willing to switch from ANY to FIRST. They can know which
standbys will become sync or potential.

... does this mean you personally want to keep the current behavior? If not,
has some other person stated a wish to keep the current behavior?

No, I want to change the current behavior. IMO it's better to set
priority 1 to all standbys in quorum set. I guess there is no longer
person who supports the current behavior.

In that case, this open item is not eligible for section "Design Decisions to
Recheck Mid-Beta". That section is for items where we'll probably change
nothing, but we plan to recheck later just in case. Here, we expect to change
the behavior; the open question is which replacement behavior to prefer.

Fujii, as the owner of this open item, you are responsible for moderating the
debate until there's adequate consensus to make a particular change or to keep
the current behavior after all. Please proceed to do that. Beta testers
deserve a UI they may like, not a UI you already plan to change later.

Please observe the policy on open item ownership[1] and send a status update
within three calendar days of this message. Include a date for your
subsequent status update.

Okay, so our consensus is to always set the priorities of sync standbys
to 1 in quorum-based syncrep case. Attached patch does this change.
Barrying any objection, I will commit this.

I will commit something to close this open item by April 28th at the latest
(IOW before my vacation starts).

Regards,

--
Fujii Masao

Attachments:

sync_priority.patchapplication/octet-stream; name=sync_priority.patchDownload

*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 951,957 **** SyncRepGetStandbyPriority(void)
  		standby_name += strlen(standby_name) + 1;
  	}
  
! 	return (found ? priority : 0);
  }
  
  /*
--- 951,964 ----
  		standby_name += strlen(standby_name) + 1;
  	}
  
! 	if (!found)
! 		return 0;
! 
! 	/*
! 	 * In quorum-based sync replication, all the standbys in the list
! 	 * have the same priority, one.
! 	 */
! 	return (SyncRepConfig->syncrep_method == SYNC_REP_PRIORITY) ? priority : 1;
  }
  
  /*
*** a/src/test/recovery/t/007_sync_rep.pl
--- b/src/test/recovery/t/007_sync_rep.pl
***************
*** 186,192 **** standby4|0|async),
  # for sync standbys in a quorum-based sync replication.
  test_sync_state(
  $node_master, qq(standby1|1|quorum
! standby2|2|quorum
  standby4|0|async),
  '2 quorum and 1 async',
  'ANY 2(standby1, standby2)');
--- 186,192 ----
  # for sync standbys in a quorum-based sync replication.
  test_sync_state(
  $node_master, qq(standby1|1|quorum
! standby2|1|quorum
  standby4|0|async),
  '2 quorum and 1 async',
  'ANY 2(standby1, standby2)');

#107

Fujii Masao

masao.fujii@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#105)

Re: Quorum commit for multiple synchronous replication.

On Mon, Apr 24, 2017 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 20, 2017 at 9:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Ok, I got the point.

At Wed, 19 Apr 2017 17:39:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170419.173901.16598616.horiguchi.kyotaro@lab.ntt.co.jp>

| <para>
| Quorum-based synchronous replication is basically more
| efficient than priority-based one when you specify multiple
| standbys in <varname>synchronous_standby_names</> and want
| to synchronously replicate transactions to two or more of
| them.

"Some" means "not all".

| In the priority-based case, the replication master
| must wait for a reply from the slowest standby in the
| required number of standbys in priority order, which may
| slower than the rest.

Quorum-based synchronous replication is expected to be more
efficient than priority-based one when your master doesn't need
to be in sync with all of the nominated standbys by
<varname>synchronous_standby_names</>.

This description may be invalid in the case where the requested number
of sync standbys is smaller than the number of "nominated" standbys by
s_s_names. For example, please imagine the case where there are five
standbys nominated by s_s_name, the requested number of sync standbys
is 2, and only two sync standbys are running. In this case, the master
needs to wait for those two standbys whatever the sync rep method is.
I think that we should rewrite that to something like "quorum-based
synchronous replication is more effecient when the requested number
of synchronous standbys is smaller than the number of potential
synchronous standbys running".

While quorum-based

replication master waits only for a specified number of fastest
standbys, priority-based replicatoin master must wait for
standbys at the top of the list, which may be slower than the
rest.

This description looks good to me. I've updated the patch based on
this description and attached it.

But I still think that the original description that I used in my patch is
better than this....

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Fujii Masao (#106)

Re: Quorum commit for multiple synchronous replication.

On Tue, Apr 25, 2017 at 12:56 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Apr 24, 2017 at 9:02 AM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 20, 2017 at 11:34:34PM -0700, Noah Misch wrote:

On Fri, Apr 21, 2017 at 01:20:05PM +0900, Masahiko Sawada wrote:

On Fri, Apr 21, 2017 at 12:02 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Apr 19, 2017 at 01:52:53PM +0900, Masahiko Sawada wrote:

On Wed, Apr 19, 2017 at 12:34 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Apr 16, 2017 at 07:25:28PM +0900, Fujii Masao wrote:

As I told firstly this is not a bug. There are some proposals for better design
of priority column in pg_stat_replication, but we've not reached the consensus
yet. So I think that it's better to move this open item to "Design Decisions to
Recheck Mid-Beta" section so that we can hear more opinions.

I'm reading that some people want to report NULL priority, some people want to
report a constant 1 priority, and nobody wants the current behavior. Is that
an accurate summary?

Yes, I think that's correct.

Okay, but ...

FWIW the reason of current behavior is that it would be useful for the
user who is willing to switch from ANY to FIRST. They can know which
standbys will become sync or potential.

... does this mean you personally want to keep the current behavior? If not,
has some other person stated a wish to keep the current behavior?

No, I want to change the current behavior. IMO it's better to set
priority 1 to all standbys in quorum set. I guess there is no longer
person who supports the current behavior.

In that case, this open item is not eligible for section "Design Decisions to
Recheck Mid-Beta". That section is for items where we'll probably change
nothing, but we plan to recheck later just in case. Here, we expect to change
the behavior; the open question is which replacement behavior to prefer.

Fujii, as the owner of this open item, you are responsible for moderating the
debate until there's adequate consensus to make a particular change or to keep
the current behavior after all. Please proceed to do that. Beta testers
deserve a UI they may like, not a UI you already plan to change later.

Please observe the policy on open item ownership[1] and send a status update
within three calendar days of this message. Include a date for your
subsequent status update.

Okay, so our consensus is to always set the priorities of sync standbys
to 1 in quorum-based syncrep case. Attached patch does this change.
Barrying any objection, I will commit this.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Fujii Masao (#107)

Re: Quorum commit for multiple synchronous replication.

At Tue, 25 Apr 2017 01:13:12 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwFZHQXfu04d+FwOOgFzvXdRoRvPrU6jFQJRF2BPLkADsQ@mail.gmail.com>

On Mon, Apr 24, 2017 at 2:55 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 20, 2017 at 9:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Ok, I got the point.

At Wed, 19 Apr 2017 17:39:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170419.173901.16598616.horiguchi.kyotaro@lab.ntt.co.jp>

| <para>
| Quorum-based synchronous replication is basically more
| efficient than priority-based one when you specify multiple
| standbys in <varname>synchronous_standby_names</> and want
| to synchronously replicate transactions to two or more of
| them.

"Some" means "not all".

| In the priority-based case, the replication master
| must wait for a reply from the slowest standby in the
| required number of standbys in priority order, which may
| slower than the rest.

Quorum-based synchronous replication is expected to be more
efficient than priority-based one when your master doesn't need
to be in sync with all of the nominated standbys by
<varname>synchronous_standby_names</>.

This description may be invalid in the case where the requested number
of sync standbys is smaller than the number of "nominated" standbys by
s_s_names. For example, please imagine the case where there are five
standbys nominated by s_s_name, the requested number of sync standbys
is 2, and only two sync standbys are running. In this case, the master
needs to wait for those two standbys whatever the sync rep method is.

Hmm. The 'nominated' standbys are standbys that their names are
listed in the s_s_names. "your master doesn't need to be in sync
with all of" means "number of sync standbys is smaller than the
number of.." So it seems to be the same... for me.

I think that we should rewrite that to something like "quorum-based
synchronous replication is more effecient when the requested number
of synchronous standbys is smaller than the number of potential
synchronous standbys running".

Against this phrase, "potential sync standbys" is "nominated
standbys".

While quorum-based

replication master waits only for a specified number of fastest
standbys, priority-based replicatoin master must wait for
standbys at the top of the list, which may be slower than the
rest.

This description looks good to me. I've updated the patch based on
this description and attached it.

But I still think that the original description that I used in my patch is
better than this....

I'm not good at composition, so I cannot insist on my
proposal. For the convenience of others, here is the proposal
from Fujii-san.

+     A quorum-based synchronous replication is basically more efficient than
+     a priority-based one when you specify multiple standbys in
+     <varname>synchronous_standby_names</> and want to replicate
+     the transactions to some of them synchronously. In this case,
+     the transactions in a priority-based synchronous replication must wait for
+     reply from the slowest standby in synchronous standbys chosen based on
+     their priorities, and which may increase the transaction latencies.
+     On the other hand, using a quorum-based synchronous replication may
+     improve those latencies because it makes the transactions wait only for
+     replies from the requested number of faster standbys in all the listed
+     standbys, i.e., such slow standby doesn't block the transactions.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Masahiko Sawada (#108)

Re: Quorum commit for multiple synchronous replication.

At Tue, 25 Apr 2017 09:22:59 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAG88zYUwhV9L5muNX-qPSB+AgzerFDD0JDDVoM25gKKw@mail.gmail.com>

Please observe the policy on open item ownership[1] and send a status update
within three calendar days of this message. Include a date for your
subsequent status update.

Okay, so our consensus is to always set the priorities of sync standbys
to 1 in quorum-based syncrep case. Attached patch does this change.
Barrying any objection, I will commit this.

+1

Ok, +1 from me.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111

Amit Kapila

amit.kapila16@gmail.com

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#109)

Re: Quorum commit for multiple synchronous replication.

On Tue, Apr 25, 2017 at 2:09 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I'm not good at composition, so I cannot insist on my
proposal. For the convenience of others, here is the proposal
from Fujii-san.

Do you see any problem with the below proposal? To me, this sounds reasonable.

+     A quorum-based synchronous replication is basically more efficient than
+     a priority-based one when you specify multiple standbys in
+     <varname>synchronous_standby_names</> and want to replicate
+     the transactions to some of them synchronously. In this case,
+     the transactions in a priority-based synchronous replication must wait for
+     reply from the slowest standby in synchronous standbys chosen based on
+     their priorities, and which may increase the transaction latencies.
+     On the other hand, using a quorum-based synchronous replication may
+     improve those latencies because it makes the transactions wait only for
+     replies from the requested number of faster standbys in all the listed
+     standbys, i.e., such slow standby doesn't block the transactions.

Can we do few modifications like:
improve those latencies --> reduce those latencies
such slow standby --> a slow standby

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Amit Kapila (#111)

Re: Quorum commit for multiple synchronous replication.

On Tue, Apr 25, 2017 at 8:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Apr 25, 2017 at 2:09 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I'm not good at composition, so I cannot insist on my
proposal. For the convenience of others, here is the proposal
from Fujii-san.

Do you see any problem with the below proposal?
To me, this sounds reasonable.

I agree.

+     A quorum-based synchronous replication is basically more efficient than
+     a priority-based one when you specify multiple standbys in
+     <varname>synchronous_standby_names</> and want to replicate
+     the transactions to some of them synchronously. In this case,
+     the transactions in a priority-based synchronous replication must wait for
+     reply from the slowest standby in synchronous standbys chosen based on
+     their priorities, and which may increase the transaction latencies.
+     On the other hand, using a quorum-based synchronous replication may
+     improve those latencies because it makes the transactions wait only for
+     replies from the requested number of faster standbys in all the listed
+     standbys, i.e., such slow standby doesn't block the transactions.

Can we do few modifications like:
improve those latencies --> reduce those latencies
such slow standby --> a slow standby

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113

Fujii Masao

masao.fujii@gmail.com

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#110)

Re: Quorum commit for multiple synchronous replication.

On Tue, Apr 25, 2017 at 5:41 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Tue, 25 Apr 2017 09:22:59 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoAG88zYUwhV9L5muNX-qPSB+AgzerFDD0JDDVoM25gKKw@mail.gmail.com>

Please observe the policy on open item ownership[1] and send a status update
within three calendar days of this message. Include a date for your
subsequent status update.

Okay, so our consensus is to always set the priorities of sync standbys
to 1 in quorum-based syncrep case. Attached patch does this change.
Barrying any objection, I will commit this.

+1

Ok, +1 from me.

Pushed. Thanks!

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Masahiko Sawada (#112)

Re: Quorum commit for multiple synchronous replication.

At Tue, 25 Apr 2017 21:21:29 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoBqpMzQ3hnLjOrAj1PX__Bqo9XWUhSX9hzAewdbQP9QKg@mail.gmail.com>

On Tue, Apr 25, 2017 at 8:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Apr 25, 2017 at 2:09 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I'm not good at composition, so I cannot insist on my
proposal. For the convenience of others, here is the proposal
from Fujii-san.

Do you see any problem with the below proposal?
To me, this sounds reasonable.

I agree.

Ok, I give up:p Thanks for shoving me.

+     A quorum-based synchronous replication is basically more efficient than
+     a priority-based one when you specify multiple standbys in
+     <varname>synchronous_standby_names</> and want to replicate
+     the transactions to some of them synchronously. In this case,
+     the transactions in a priority-based synchronous replication must wait for
+     reply from the slowest standby in synchronous standbys chosen based on
+     their priorities, and which may increase the transaction latencies.
+     On the other hand, using a quorum-based synchronous replication may
+     improve those latencies because it makes the transactions wait only for
+     replies from the requested number of faster standbys in all the listed
+     standbys, i.e., such slow standby doesn't block the transactions.

Can we do few modifications like:
improve those latencies --> reduce those latencies
such slow standby --> a slow standby

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Noah Misch

noah@leadboat.com

over 8 years ago

In reply to: Petr Jelinek (#77)

Re: Quorum commit for multiple synchronous replication.

On Thu, Apr 06, 2017 at 08:55:37AM +0200, Petr Jelinek wrote:

On 06/04/17 03:51, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

I was one of the people who said in original thread that I think the
default behavior should change to quorum and I am still of that opinion.

This item appears under "decisions to recheck mid-beta". If anyone is going
to push for a change here, now is the time.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Noah Misch (#115)

Re: Quorum commit for multiple synchronous replication.

On Fri, Jul 28, 2017 at 2:24 PM, Noah Misch <noah@leadboat.com> wrote:

On Thu, Apr 06, 2017 at 08:55:37AM +0200, Petr Jelinek wrote:

On 06/04/17 03:51, Noah Misch wrote:

On Thu, Apr 06, 2017 at 12:48:56AM +0900, Fujii Masao wrote:

On Wed, Apr 5, 2017 at 3:45 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Dec 19, 2016 at 09:49:58PM +0900, Fujii Masao wrote:

Regarding this feature, there are some loose ends. We should work on
and complete them until the release.

(1)
Which synchronous replication method, priority or quorum, should be
chosen when neither FIRST nor ANY is specified in s_s_names? Right now,
a priority-based sync replication is chosen for keeping backward
compatibility. However some hackers argued to change this decision
so that a quorum commit is chosen because they think that most users
prefer to a quorum.

The items (1) and (3) are not bugs. So I don't think that they need to be
resolved before the beta release. After the feature freeze, many users
will try and play with many new features including quorum-based syncrep.
Then if many of them complain about (1) and (3), we can change the code
at that timing. So we need more time that users can try the feature.

I've moved (1) to a new section for things to revisit during beta. If someone
feels strongly that the current behavior is Wrong and must change, speak up as
soon as you reach that conclusion. Absent such arguments, the behavior won't
change.

I was one of the people who said in original thread that I think the
default behavior should change to quorum and I am still of that opinion.

This item appears under "decisions to recheck mid-beta". If anyone is going
to push for a change here, now is the time.

It has been 1 week since the previous mail. I though that there were
others argued to change the behavior of old-style setting so that a
quorum commit is chosen. If nobody is going to push for a change we
can live with the current behavior?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#116)

Re: Quorum commit for multiple synchronous replication.

On Fri, Aug 4, 2017 at 8:19 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 28, 2017 at 2:24 PM, Noah Misch <noah@leadboat.com> wrote:

This item appears under "decisions to recheck mid-beta". If anyone is going
to push for a change here, now is the time.

It has been 1 week since the previous mail. I though that there were
others argued to change the behavior of old-style setting so that a
quorum commit is chosen. If nobody is going to push for a change we
can live with the current behavior?

FWIW, I still see no harm in keeping backward-compatibility here, so I
am in favor of a statu-quo.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118

Josh Berkus

josh@berkus.org

over 8 years ago

In reply to: Michael Paquier (#117)

Re: Quorum commit for multiple synchronous replication.

On 08/09/2017 10:49 PM, Michael Paquier wrote:

On Fri, Aug 4, 2017 at 8:19 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 28, 2017 at 2:24 PM, Noah Misch <noah@leadboat.com> wrote:

This item appears under "decisions to recheck mid-beta". If anyone is going
to push for a change here, now is the time.

It has been 1 week since the previous mail. I though that there were
others argued to change the behavior of old-style setting so that a
quorum commit is chosen. If nobody is going to push for a change we
can live with the current behavior?

FWIW, I still see no harm in keeping backward-compatibility here, so I
am in favor of a statu-quo.

I am vaguely in favor of making quorum the default over "ordered".
However, given that anybody using sync commit without
understanding/customizing the setup is going to be sorry regardless,
keeping backwards compatibility is acceptable.

--
Josh Berkus
Containers & Databases Oh My!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Josh Berkus (#118)

Re: Quorum commit for multiple synchronous replication.

On Fri, Aug 11, 2017 at 1:40 AM, Josh Berkus <josh@berkus.org> wrote:

On 08/09/2017 10:49 PM, Michael Paquier wrote:

On Fri, Aug 4, 2017 at 8:19 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 28, 2017 at 2:24 PM, Noah Misch <noah@leadboat.com> wrote:

This item appears under "decisions to recheck mid-beta". If anyone is going
to push for a change here, now is the time.

It has been 1 week since the previous mail. I though that there were
others argued to change the behavior of old-style setting so that a
quorum commit is chosen. If nobody is going to push for a change we
can live with the current behavior?

FWIW, I still see no harm in keeping backward-compatibility here, so I
am in favor of a statu-quo.

I am vaguely in favor of making quorum the default over "ordered".
However, given that anybody using sync commit without
understanding/customizing the setup is going to be sorry regardless,
keeping backwards compatibility is acceptable.

Thank you for the comment.

FWIW, in my opinion if tte current behavior of 'N(a,b)' could confuse
users and we want to break the backward compatibility, I'd rather like
to remove that style in PostgreSQL 10 and to raise an syntax error to
user for more safety. Also, since the syntax 'a, b' might be opaque
for new users who don't know the history of s_s_names syntax, we could
unify its syntax to '[ANY|FIRST] N (a, b, ...)' syntax while keeping
the '*'.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#119)

Re: Quorum commit for multiple synchronous replication.

On Wed, Aug 16, 2017 at 4:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

FWIW, in my opinion if tte current behavior of 'N(a,b)' could confuse
users and we want to break the backward compatibility, I'd rather like
to remove that style in PostgreSQL 10 and to raise an syntax error to
user for more safety. Also, since the syntax 'a, b' might be opaque
for new users who don't know the history of s_s_names syntax, we could
unify its syntax to '[ANY|FIRST] N (a, b, ...)' syntax while keeping
the '*'.

I find the removal of a syntax in release N for something introduced
in release (N - 1) a bit hard to swallow from the user prospective.
What about just issuing a warning instead and say that the use of
ANY/FIRST is recommended? It costs nothing in maintenance to keep it
around.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Michael Paquier (#120)

1 attachment(s)

Re: Quorum commit for multiple synchronous replication.

On Wed, Aug 16, 2017 at 4:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 16, 2017 at 4:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

FWIW, in my opinion if tte current behavior of 'N(a,b)' could confuse
users and we want to break the backward compatibility, I'd rather like
to remove that style in PostgreSQL 10 and to raise an syntax error to
user for more safety. Also, since the syntax 'a, b' might be opaque
for new users who don't know the history of s_s_names syntax, we could
unify its syntax to '[ANY|FIRST] N (a, b, ...)' syntax while keeping
the '*'.

I find the removal of a syntax in release N for something introduced
in release (N - 1) a bit hard to swallow from the user prospective.
What about just issuing a warning instead and say that the use of
ANY/FIRST is recommended? It costs nothing in maintenance to keep it
around.

Yeah, I think that would be better. If we decide to not make quorum
commit the default we can issue a warning in docs. Attached a draft
patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

warning_s_s_names.patchapplication/octet-stream; name=warning_s_s_names.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2b6255e..d8a3014 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3172,6 +3172,12 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
         as a synchronous standby.
        </para>
        <para>
+        Since the behavior of both the first syntax without
+        <literal>FIRST</literal> and the third syntax could be changed in a
+        future release, the use of <literal>FIRST</literal> and <literal>ANY</literal>
+        explicitly is recommended.
+       </para>
+       <para>
         The special entry <literal>*</> matches any standby name.
        </para>
        <para>

#122

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#121)

Re: Quorum commit for multiple synchronous replication.

On Thu, Aug 17, 2017 at 2:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Aug 16, 2017 at 4:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Aug 16, 2017 at 4:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

FWIW, in my opinion if tte current behavior of 'N(a,b)' could confuse
users and we want to break the backward compatibility, I'd rather like
to remove that style in PostgreSQL 10 and to raise an syntax error to
user for more safety. Also, since the syntax 'a, b' might be opaque
for new users who don't know the history of s_s_names syntax, we could
unify its syntax to '[ANY|FIRST] N (a, b, ...)' syntax while keeping
the '*'.

I find the removal of a syntax in release N for something introduced
in release (N - 1) a bit hard to swallow from the user prospective.
What about just issuing a warning instead and say that the use of
ANY/FIRST is recommended? It costs nothing in maintenance to keep it
around.

Yeah, I think that would be better. If we decide to not make quorum
commit the default we can issue a warning in docs. Attached a draft
patch.

I had in mind a ereport(WARNING) in create_syncrep_config. Extra
thoughts/opinions welcome.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Michael Paquier (#122)

Re: Quorum commit for multiple synchronous replication.

On Thu, Aug 17, 2017 at 1:13 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

I had in mind a ereport(WARNING) in create_syncrep_config. Extra
thoughts/opinions welcome.

I think for v10 we should just document the behavior we've got; I
think it's too late to be whacking things around now.

For v11, we could emit a warning if we plan to deprecate and
eventually remove the syntax without ANY/FIRST, but let's not do:

WARNING: what you did is ok, but you might have wanted to do something else

First of all, whether or not that can properly be called a warning is
highly debatable. Also, if you do that sort of thing to your spouse
and/or children, they call it "nagging". I don't think users will
like it any more than family members do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Robert Haas (#123)

Re: Quorum commit for multiple synchronous replication.

On Sat, Aug 19, 2017 at 12:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 17, 2017 at 1:13 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

I had in mind a ereport(WARNING) in create_syncrep_config. Extra
thoughts/opinions welcome.

I think for v10 we should just document the behavior we've got; I
think it's too late to be whacking things around now.

For v11, we could emit a warning if we plan to deprecate and
eventually remove the syntax without ANY/FIRST, but let's not do:

WARNING: what you did is ok, but you might have wanted to do something else

First of all, whether or not that can properly be called a warning is
highly debatable. Also, if you do that sort of thing to your spouse
and/or children, they call it "nagging". I don't think users will
like it any more than family members do.

It seems to me that we should discuss whether we want to keep the some
syntax such as 'a,b', 'N(a,b)' before thinking whether or not that
making the quorum commit the default behavior of 'N(a,b)' syntax. If
we plan to remove such syntax in a future release we can live with the
current code and should document it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#124)

Re: Quorum commit for multiple synchronous replication.

On Wed, Aug 23, 2017 at 3:04 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It seems to me that we should discuss whether we want to keep the some
syntax such as 'a,b', 'N(a,b)' before thinking whether or not that
making the quorum commit the default behavior of 'N(a,b)' syntax. If
we plan to remove such syntax in a future release we can live with the
current code and should document it.

The parsing code of repl_gram.y represents zero maintenance at the
end, so let me suggest to just live with what we have and do nothing.
Things kept as they are are not bad either. By changing the default,
people may have their failover flows silently trapped. So if we change
the default we will perhaps make some users happy, but I think that we
are going to make also some people angry. That's not fun to debug
silent failover issues.

At the end of the day, we could just add one sentence in the docs
saying the use of ANY and FIRST is encouraged over the past grammar
because they are clearer to understand.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126

Josh Berkus

josh@berkus.org

over 8 years ago

In reply to: Masahiko Sawada (#124)

Re: Quorum commit for multiple synchronous replication.

On 08/22/2017 11:04 PM, Masahiko Sawada wrote:

WARNING: what you did is ok, but you might have wanted to do something else

First of all, whether or not that can properly be called a warning is
highly debatable. Also, if you do that sort of thing to your spouse
and/or children, they call it "nagging". I don't think users will
like it any more than family members do.

Realistically, we'll support the backwards-compatible syntax for 3-5
years. Which is fine.

I suggest that we just gradually deprecate the old syntax from the docs,
and then around Postgres 16 eliminate it. I posit that that's better
than changing the meaning of the old syntax out from under people.

--
Josh Berkus
Containers & Databases Oh My!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Josh Berkus (#126)

Re: Quorum commit for multiple synchronous replication.

On Thu, Aug 24, 2017 at 3:11 AM, Josh Berkus <josh@berkus.org> wrote:

On 08/22/2017 11:04 PM, Masahiko Sawada wrote:

WARNING: what you did is ok, but you might have wanted to do something else

First of all, whether or not that can properly be called a warning is
highly debatable. Also, if you do that sort of thing to your spouse
and/or children, they call it "nagging". I don't think users will
like it any more than family members do.

Realistically, we'll support the backwards-compatible syntax for 3-5
years. Which is fine.

I suggest that we just gradually deprecate the old syntax from the docs,
and then around Postgres 16 eliminate it. I posit that that's better
than changing the meaning of the old syntax out from under people.

It seems to me that there is no folk who intently votes for making the
quorum commit the default. There some folks suggest to keep backward
compatibility in PG10 and gradually deprecate the old syntax. And only
the issuing from docs can be possible in PG10.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#127)

Re: Quorum commit for multiple synchronous replication.

On Thu, Aug 24, 2017 at 4:27 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Aug 24, 2017 at 3:11 AM, Josh Berkus <josh@berkus.org> wrote:

On 08/22/2017 11:04 PM, Masahiko Sawada wrote:

WARNING: what you did is ok, but you might have wanted to do something else

First of all, whether or not that can properly be called a warning is
highly debatable. Also, if you do that sort of thing to your spouse
and/or children, they call it "nagging". I don't think users will
like it any more than family members do.

Realistically, we'll support the backwards-compatible syntax for 3-5
years. Which is fine.

I suggest that we just gradually deprecate the old syntax from the docs,
and then around Postgres 16 eliminate it. I posit that that's better
than changing the meaning of the old syntax out from under people.

It seems to me that there is no folk who intently votes for making the
quorum commit the default. There some folks suggest to keep backward
compatibility in PG10 and gradually deprecate the old syntax. And only
the issuing from docs can be possible in PG10.

According to the discussion so far, it seems to me that keeping
backward compatibility and issuing a warning in docs that old syntax
could be changed or removed in a future release is the most acceptable
way in PG10. There is no objection against that so far and I already
posted a patch to add a warning in docs[1]/messages/by-id/CAD21AoAe+oGSFi3bjZ+fW6Q=TK7avPdDCZLEr02zM_c-U0JsRA@mail.gmail.com. I'll wait for the
committer's decision.

[1]: /messages/by-id/CAD21AoAe+oGSFi3bjZ+fW6Q=TK7avPdDCZLEr02zM_c-U0JsRA@mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers