Additional options for Sync Replication

Started by Simon Riggsalmost 15 years ago36 messages
#1Simon Riggs
simon@2ndQuadrant.com
1 attachment(s)

Proposed changes: Make synchronous_replication into an enum, so we
can now also say synchronous_replication = recv, flush or apply as
well as on or off.
synchronous_replication = on is the same as "flush"

Benefit: Allows 2 additional wait modes for sync rep: wait for
receive and wait for apply.

Rationale:

I was hoping to fine tune/tweak Sync Rep after feedback during beta,
but my understanding of current consensus is that that will be too
late to make user visible changes. So I'm proposing this change now,
before Beta, rather than during Beta.

Since we now have replies from standby arriving when we receive data,
not just fsync, we may as well do something useful with them in this
release.

So I reintroduce changes made in the original patch submission that
were split out for piece-by-piece commit. This was also part of the
originally agreed functionality for Sync Rep. So I don't expect any
procedural difficulties in accepting this patch. Heikki was right to
request removal of that code while we got things correct. Adding it
back now is much simpler.

The internal changes are minor, simply changing things so we have 3
queues rather than 1. The user backend path is identical in length,
the sync standby path very slightly longer. Replies from standby
continue to be sent every 100ms until all flushed WAL has been
applied, then it falls back to the wal_receiver_status_interval, again
a minor change.

The parameter is an enum since multiple people called Josh asked for a
very simple interface "on" or "off", while others requested multiple
favours. This gives us both.

I expect "recv" mode to be even more useful in next release, with
WALWriter active during recovery.

Patch feedback please.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

syncrep_queues.v1.cpatchapplication/octet-stream; name=syncrep_queues.v1.cpatchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2034,2040 **** SET ENABLE_SEQSCAN TO OFF;
  
       <variablelist>
       <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
!       <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
        <indexterm>
         <primary><varname>synchronous_replication</> configuration parameter</primary>
        </indexterm>
--- 2034,2040 ----
  
       <variablelist>
       <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
!       <term><varname>synchronous_replication</varname> (<type>enum</type>)</term>
        <indexterm>
         <primary><varname>synchronous_replication</> configuration parameter</primary>
        </indexterm>
***************
*** 2063,2068 **** SET ENABLE_SEQSCAN TO OFF;
--- 2063,2081 ----
          <command>SET LOCAL synchronous_replication TO OFF</> within the
          transaction.
         </para>
+        <para>
+         In addition to <literal>on</> and <literal>off</> it is possible
+         to specify the exact level of durability required with corresponding
+         duration of wait.
+         The allowed values of <varname>synchronous_replication</> are
+         <literal>off</> do not wait for a response from standby
+         <literal>on</> commit waits for sync to disk on standby,
+         <literal>recv</> commit waits for standby to receive into memory (no sync),
+         <literal>sync</> commit waits for sync to disk on standby,
+         <literal>apply</> commit waits for standby to apply changes so that they
+         will be visible to queries on standby server.
+       </para>
+ 
        </listitem>
       </varlistentry>
  
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 63,69 ****
  #include "utils/ps_status.h"
  
  /* User-settable parameters for sync rep */
! bool	synchronous_replication = false;		/* Only set in user backends */
  char 	*SyncRepStandbyNames;
  
  #define SyncStandbysDefined() \
--- 63,69 ----
  #include "utils/ps_status.h"
  
  /* User-settable parameters for sync rep */
! SyncRepWaitType	synchronous_replication = SYNC_REP_NO_WAIT;	/* Only set in user backends */
  char 	*SyncRepStandbyNames;
  
  #define SyncStandbysDefined() \
***************
*** 71,77 **** char 	*SyncRepStandbyNames;
  
  static bool announce_next_takeover = true;
  
! static void SyncRepQueueInsert(void);
  static void SyncRepCancelWait(void);
  
  static int SyncRepGetStandbyPriority(void);
--- 71,77 ----
  
  static bool announce_next_takeover = true;
  
! static void SyncRepQueueInsert(int queue);
  static void SyncRepCancelWait(void);
  
  static int SyncRepGetStandbyPriority(void);
***************
*** 99,113 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  {
  	char 		*new_status = NULL;
  	const char *old_status;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or
  	 * there are no sync replication standby names defined.
  	 * Note that those standbys don't need to be connected.
  	 */
! 	if (!SyncRepRequested() || !SyncStandbysDefined())
  		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
  
--- 99,130 ----
  {
  	char 		*new_status = NULL;
  	const char *old_status;
+ 	int			queue = 1;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or
  	 * there are no sync replication standby names defined.
  	 * Note that those standbys don't need to be connected.
  	 */
! 	if (!SyncStandbysDefined())
  		return;
  
+ 	switch (synchronous_replication)
+ 	{
+ 		case SYNC_REP_NOT_WAITING:
+ 			return;
+ 
+ 		case SYNC_REP_WAIT_FOR_WRITE:
+ 			queue = 0;
+ 
+ 		case SYNC_REP_WAIT_FOR_APPLY:
+ 			queue = 2;
+ 
+ 		default:
+ 		case SYNC_REP_WAIT_FOR_FLUSH:
+ 			queue = 1;
+ 	}
+ 
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
  
***************
*** 126,132 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * so its likely to be a low cost check.
  	 */
  	if (!WalSndCtl->sync_standbys_defined ||
! 		XLByteLE(XactCommitLSN, WalSndCtl->lsn))
  	{
  		LWLockRelease(SyncRepLock);
  		return;
--- 143,149 ----
  	 * so its likely to be a low cost check.
  	 */
  	if (!WalSndCtl->sync_standbys_defined ||
! 		XLByteLE(XactCommitLSN, WalSndCtl->lsn[queue]))
  	{
  		LWLockRelease(SyncRepLock);
  		return;
***************
*** 138,144 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 */
  	MyProc->waitLSN = XactCommitLSN;
  	MyProc->syncRepState = SYNC_REP_WAITING;
! 	SyncRepQueueInsert();
  	Assert(SyncRepQueueIsOrderedByLSN());
  	LWLockRelease(SyncRepLock);
  
--- 155,161 ----
  	 */
  	MyProc->waitLSN = XactCommitLSN;
  	MyProc->syncRepState = SYNC_REP_WAITING;
! 	SyncRepQueueInsert(queue);
  	Assert(SyncRepQueueIsOrderedByLSN());
  	LWLockRelease(SyncRepLock);
  
***************
*** 273,284 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
   * here out of order, so start at tail and work back to insertion point.
   */
  static void
! SyncRepQueueInsert(void)
  {
  	PGPROC	*proc;
  
! 	proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
! 								   &(WalSndCtl->SyncRepQueue),
  								   offsetof(PGPROC, syncRepLinks));
  
  	while (proc)
--- 290,301 ----
   * here out of order, so start at tail and work back to insertion point.
   */
  static void
! SyncRepQueueInsert(int queue)
  {
  	PGPROC	*proc;
  
! 	proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[queue]),
! 								   &(WalSndCtl->SyncRepQueue[queue]),
  								   offsetof(PGPROC, syncRepLinks));
  
  	while (proc)
***************
*** 290,296 **** SyncRepQueueInsert(void)
  		if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
  			break;
  
! 		proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
  									   &(proc->syncRepLinks),
  									   offsetof(PGPROC, syncRepLinks));
  	}
--- 307,313 ----
  		if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
  			break;
  
! 		proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[queue]),
  									   &(proc->syncRepLinks),
  									   offsetof(PGPROC, syncRepLinks));
  	}
***************
*** 298,304 **** SyncRepQueueInsert(void)
  	if (proc)
  		SHMQueueInsertAfter(&(proc->syncRepLinks), &(MyProc->syncRepLinks));
  	else
! 		SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue), &(MyProc->syncRepLinks));
  }
  
  /*
--- 315,321 ----
  	if (proc)
  		SHMQueueInsertAfter(&(proc->syncRepLinks), &(MyProc->syncRepLinks));
  	else
! 		SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue[queue]), &(MyProc->syncRepLinks));
  }
  
  /*
***************
*** 421,442 **** SyncRepReleaseWaiters(void)
  		return;
  	}
  
! 	if (XLByteLT(walsndctl->lsn, MyWalSnd->flush))
! 	{
! 		/*
! 		 * Set the lsn first so that when we wake backends they will
! 		 * release up to this location.
! 		 */
! 		walsndctl->lsn = MyWalSnd->flush;
! 		numprocs = SyncRepWakeQueue(false);
! 	}
  
! 	LWLockRelease(SyncRepLock);
  
! 	elog(DEBUG3, "released %d procs up to %X/%X",
! 					numprocs,
! 					MyWalSnd->flush.xlogid,
! 					MyWalSnd->flush.xrecoff);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
--- 438,457 ----
  		return;
  	}
  
! 	/*
! 	 * Set the lsn first so that when we wake backends they will
! 	 * release up to this location.
! 	 */
! 	if (XLByteLT(walsndctl->lsn[0], MyWalSnd->write))
! 		walsndctl->lsn[0] = MyWalSnd->write;
! 	if (XLByteLT(walsndctl->lsn[1], MyWalSnd->flush))
! 		walsndctl->lsn[1] = MyWalSnd->flush;
! 	if (XLByteLT(walsndctl->lsn[2], MyWalSnd->apply))
! 		walsndctl->lsn[2] = MyWalSnd->apply;
  
! 	numprocs = SyncRepWakeQueue(false);
  
! 	LWLockRelease(SyncRepLock);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
***************
*** 515,563 **** SyncRepWakeQueue(bool all)
  	PGPROC	*proc = NULL;
  	PGPROC	*thisproc = NULL;
  	int		numprocs = 0;
  
  	Assert(SyncRepQueueIsOrderedByLSN());
  
! 	proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! 								   &(WalSndCtl->SyncRepQueue),
! 								   offsetof(PGPROC, syncRepLinks));
! 
! 	while (proc)
  	{
! 		/*
! 		 * Assume the queue is ordered by LSN
! 		 */
! 		if (!all && XLByteLT(walsndctl->lsn, proc->waitLSN))
! 			return numprocs;
! 
! 		/*
! 		 * Move to next proc, so we can delete thisproc from the queue.
! 		 * thisproc is valid, proc may be NULL after this.
! 		 */
! 		thisproc = proc;
! 		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! 									   &(proc->syncRepLinks),
  									   offsetof(PGPROC, syncRepLinks));
  
! 		/*
! 		 * Set state to complete; see SyncRepWaitForLSN() for discussion
! 		 * of the various states.
! 		 */
! 		thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
! 
! 		/*
! 		 * Remove thisproc from queue.
! 		 */
! 		SHMQueueDelete(&(thisproc->syncRepLinks));
! 
! 		/*
! 		 * Wake only when we have set state and removed from queue.
! 		 */
! 		Assert(SHMQueueIsDetached(&(thisproc->syncRepLinks)));
! 		Assert(thisproc->syncRepState == SYNC_REP_WAIT_COMPLETE);
! 		SetLatch(&(thisproc->waitLatch));
! 
! 		numprocs++;
  	}
  
  	return numprocs;
--- 530,582 ----
  	PGPROC	*proc = NULL;
  	PGPROC	*thisproc = NULL;
  	int		numprocs = 0;
+ 	int		i;
  
  	Assert(SyncRepQueueIsOrderedByLSN());
  
! 	for (i = 0; i < MAX_SYNC_REP_QUEUES; i++)
  	{
! 		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[i]),
! 									   &(WalSndCtl->SyncRepQueue[i]),
  									   offsetof(PGPROC, syncRepLinks));
  
! 		while (proc)
! 		{
! 			/*
! 			 * Assume the queue is ordered by LSN
! 			 */
! 			if (!all && XLByteLT(walsndctl->lsn[i], proc->waitLSN))
! 				return numprocs;
! 
! 			/*
! 			 * Move to next proc, so we can delete thisproc from the queue.
! 			 * thisproc is valid, proc may be NULL after this.
! 			 */
! 			thisproc = proc;
! 			proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[i]),
! 										   &(proc->syncRepLinks),
! 										   offsetof(PGPROC, syncRepLinks));
! 
! 			/*
! 			 * Set state to complete; see SyncRepWaitForLSN() for discussion
! 			 * of the various states.
! 			 */
! 			thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
! 
! 			/*
! 			 * Remove thisproc from queue.
! 			 */
! 			SHMQueueDelete(&(thisproc->syncRepLinks));
! 
! 			/*
! 			 * Wake only when we have set state and removed from queue.
! 			 */
! 			Assert(SHMQueueIsDetached(&(thisproc->syncRepLinks)));
! 			Assert(thisproc->syncRepState == SYNC_REP_WAIT_COMPLETE);
! 			SetLatch(&(thisproc->waitLatch));
! 
! 			numprocs++;
! 		}
  	}
  
  	return numprocs;
***************
*** 606,633 **** SyncRepQueueIsOrderedByLSN(void)
  {
  	PGPROC	*proc = NULL;
  	XLogRecPtr lastLSN;
  
! 	lastLSN.xlogid = 0;
! 	lastLSN.xrecoff = 0;
! 
! 	proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! 								   &(WalSndCtl->SyncRepQueue),
! 								   offsetof(PGPROC, syncRepLinks));
! 
! 	while (proc)
  	{
! 		/*
! 		 * Check the queue is ordered by LSN and that multiple
! 		 * procs don't have matching LSNs
! 		 */
! 		if (XLByteLE(proc->waitLSN, lastLSN))
! 			return false;
  
! 		lastLSN = proc->waitLSN;
! 
! 		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! 									   &(proc->syncRepLinks),
  									   offsetof(PGPROC, syncRepLinks));
  	}
  
  	return true;
--- 625,656 ----
  {
  	PGPROC	*proc = NULL;
  	XLogRecPtr lastLSN;
+ 	int		i;
  
! 	for (i = 0; i < MAX_SYNC_REP_QUEUES; i++)
  	{
! 		lastLSN.xlogid = 0;
! 		lastLSN.xrecoff = 0;
  
! 		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[i]),
! 									   &(WalSndCtl->SyncRepQueue[i]),
  									   offsetof(PGPROC, syncRepLinks));
+ 
+ 		while (proc)
+ 		{
+ 			/*
+ 			 * Check the queue is ordered by LSN and that multiple
+ 			 * procs don't have matching LSNs
+ 			 */
+ 			if (XLByteLE(proc->waitLSN, lastLSN))
+ 				return false;
+ 
+ 			lastLSN = proc->waitLSN;
+ 
+ 			proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[i]),
+ 										   &(proc->syncRepLinks),
+ 										   offsetof(PGPROC, syncRepLinks));
+ 		}
  	}
  
  	return true;
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 610,623 **** XLogWalRcvSendReply(void)
  	/*
  	 * We can compare the write and flush positions to the last message we
  	 * sent without taking any lock, but the apply position requires a spin
! 	 * lock, so we don't check that unless something else has changed or 10
! 	 * seconds have passed.  This means that the apply log position will
! 	 * appear, from the master's point of view, to lag slightly, but since
! 	 * this is only for reporting purposes and only on idle systems, that's
! 	 * probably OK.
  	 */
  	if (XLByteEQ(reply_message.write, LogstreamResult.Write)
  		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
  		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
  			wal_receiver_status_interval * 1000))
  		return;
--- 610,621 ----
  	/*
  	 * We can compare the write and flush positions to the last message we
  	 * sent without taking any lock, but the apply position requires a spin
! 	 * lock, so we just check whether the last message showed that all
! 	 * flushed WAL data has been applied.
  	 */
  	if (XLByteEQ(reply_message.write, LogstreamResult.Write)
  		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
+ 		&& XLByteEQ(reply_message.flush, reply_message.apply)
  		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
  			wal_receiver_status_interval * 1000))
  		return;
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 1247,1253 **** WalSndShmemInit(void)
  		/* First time through, so initialize */
  		MemSet(WalSndCtl, 0, WalSndShmemSize());
  
! 		SHMQueueInit(&(WalSndCtl->SyncRepQueue));
  
  		for (i = 0; i < max_wal_senders; i++)
  		{
--- 1247,1256 ----
  		/* First time through, so initialize */
  		MemSet(WalSndCtl, 0, WalSndShmemSize());
  
! 		for (i = 0; i < MAX_SYNC_REP_QUEUES; i++)
! 		{
! 			SHMQueueInit(&(WalSndCtl->SyncRepQueue[i]));
! 		}
  
  		for (i = 0; i < max_wal_senders; i++)
  		{
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 755,768 **** static struct config_bool ConfigureNamesBool[] =
  		true, NULL, NULL
  	},
  	{
- 		{"synchronous_replication", PGC_USERSET, WAL_REPLICATION,
- 			gettext_noop("Requests synchronous replication."),
- 			NULL
- 		},
- 		&synchronous_replication,
- 		false, NULL, NULL
- 	},
- 	{
  		{"zero_damaged_pages", PGC_SUSET, DEVELOPER_OPTIONS,
  			gettext_noop("Continues processing past damaged page headers."),
  			gettext_noop("Detection of a damaged page header normally causes PostgreSQL to "
--- 755,760 ----
***************
*** 2818,2823 **** static struct config_enum ConfigureNamesEnum[] =
--- 2810,2825 ----
  	},
  
  	{
+ 		{"synchronous_replication", PGC_USERSET, WAL_REPLICATION,
+ 			gettext_noop("Requests synchronous replication."),
+ 			NULL
+ 		},
+ 		&synchronous_replication,
+ 		SYNC_REP_NO_WAIT, synchronous_replication_options,
+ 		NULL, NULL
+ 	},
+ 
+ 	{
  		{"default_transaction_isolation", PGC_USERSET, CLIENT_CONN_STATEMENT,
  			gettext_noop("Sets the transaction isolation level of each new transaction."),
  			NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 186,192 ****
  
  # - Replication - User Settings
  
! #synchronous_replication = off		# does commit wait for reply from standby
  
  # - Streaming Replication - Server Settings
  
--- 186,196 ----
  
  # - Replication - User Settings
  
! #synchronous_replication = off	# commit wait for reply from standby
! 								# valid values: on, off, recv, sync, apply
! 								# recv = in memory on standby
! 								# sync = flushed to disk on standby (= on)
! 								# apply = change applied and ready for query
  
  # - Streaming Replication - Server Settings
  
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 20,34 ****
  #include "utils/guc.h"
  
  #define SyncRepRequested() \
! 	(synchronous_replication && max_wal_senders > 0)
  
  /* syncRepState */
  #define SYNC_REP_NOT_WAITING		0
  #define SYNC_REP_WAITING			1
  #define SYNC_REP_WAIT_COMPLETE		2
  
  /* user-settable parameters for synchronous replication */
! extern bool synchronous_replication;
  extern char *SyncRepStandbyNames;
  
  /* called by user backend */
--- 20,65 ----
  #include "utils/guc.h"
  
  #define SyncRepRequested() \
! 	(synchronous_replication >= 1 && max_wal_senders > 0)
  
  /* syncRepState */
  #define SYNC_REP_NOT_WAITING		0
  #define SYNC_REP_WAITING			1
  #define SYNC_REP_WAIT_COMPLETE		2
  
+ typedef enum SyncRepWaitType
+ {
+ 	SYNC_REP_NO_WAIT,			/* async replication */
+ 	SYNC_REP_WAIT_FOR_WRITE,	/* sync rep, wait for write of data on standby */
+ 	SYNC_REP_WAIT_FOR_FLUSH,	/* sync rep, wait for fsync of data on standby */
+ 	SYNC_REP_WAIT_FOR_APPLY,	/* sync rep, wait for apply of data on standby */
+ } SyncRepWaitType;
+ 
+ /*
+  * Although only "on", "off", "async", "write", "fsync" and "apply" are documented, we
+  * accept all the likely variants of "on" and "off".
+  */
+ static const struct config_enum_entry synchronous_replication_options[] = {
+ 	{"async", SYNC_REP_NO_WAIT, false},
+ 	{"recv", SYNC_REP_WAIT_FOR_WRITE, false},
+ 	{"sync", SYNC_REP_WAIT_FOR_FLUSH, false},
+ 	{"apply", SYNC_REP_WAIT_FOR_APPLY, false},
+ 	{"on", SYNC_REP_WAIT_FOR_FLUSH, false},
+ 	{"off", SYNC_REP_NO_WAIT, false},
+ 	{"true", SYNC_REP_WAIT_FOR_FLUSH, true},
+ 	{"false", SYNC_REP_NO_WAIT, true},
+ 	{"yes", SYNC_REP_WAIT_FOR_FLUSH, true},
+ 	{"no", SYNC_REP_NO_WAIT, true},
+ 	{"1", SYNC_REP_WAIT_FOR_FLUSH, true},
+ 	{"0", SYNC_REP_NO_WAIT, true},
+ 	{NULL, 0, false}
+ };
+ 
+ /* sync replication has separate queues for write, fsync and apply */
+ #define MAX_SYNC_REP_QUEUES			3
+ 
  /* user-settable parameters for synchronous replication */
! extern SyncRepWaitType synchronous_replication;
  extern char *SyncRepStandbyNames;
  
  /* called by user backend */
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 68,82 **** extern WalSnd *MyWalSnd;
  typedef struct
  {
  	/*
! 	 * Synchronous replication queue. Protected by SyncRepLock.
  	 */
! 	SHM_QUEUE SyncRepQueue;
  
  	/*
  	 * Current location of the head of the queue. All waiters should have
  	 * a waitLSN that follows this value. Protected by SyncRepLock.
  	 */
! 	XLogRecPtr	lsn;
  
  	/*
  	 * Are any sync standbys defined?  Waiting backends can't reload the
--- 68,82 ----
  typedef struct
  {
  	/*
! 	 * Synchronous replication queues. Protected by SyncRepLock.
  	 */
! 	SHM_QUEUE SyncRepQueue[MAX_SYNC_REP_QUEUES];
  
  	/*
  	 * Current location of the head of the queue. All waiters should have
  	 * a waitLSN that follows this value. Protected by SyncRepLock.
  	 */
! 	XLogRecPtr	lsn[MAX_SYNC_REP_QUEUES];
  
  	/*
  	 * Are any sync standbys defined?  Waiting backends can't reload the
#2Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#1)
Re: Additional options for Sync Replication

On Sun, Mar 27, 2011 at 5:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I was hoping to fine tune/tweak Sync Rep after feedback during beta,
but my understanding of current consensus is that that will be too
late to make user visible changes. So I'm proposing this change now,
before Beta, rather than during Beta.

This is completely inappropriate. The deadline for new feature
patches has long since passed. It was January 15th. The discussion
you are referring to had to do with fixing the behavior of features we
already have, not adding new ones.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Greg Stark
gsstark@mit.edu
In reply to: Simon Riggs (#1)
Re: Additional options for Sync Replication

On Sun, Mar 27, 2011 at 10:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I was hoping to fine tune/tweak Sync Rep after feedback during beta,
but my understanding of current consensus is that that will be too
late to make user visible changes. So I'm proposing this change now,
before Beta, rather than during Beta.

For what it's worth I think this is a simplification. We have:

1) Development when new features are added

2) Feature freeze - when those features are tweaked and fixed based on
our own testing but no new features added

3) Beta - when features are tweaked and fixed in response to user
suggestions but no new features added

4) Release - when only bugs are fixed

So the question becomes, what is a new feature versus a behaviour
change to an existing feature or a bug-fix. These are subjective
questions with a lot of room for ambiguity. Is this a new feature? Is
this existing code broken or unusable in some way that we don't want
to release and have to support?

We're not in a vacuum here and we all see Tom reworking major portions
of the collation patch while you're being told you can't squeak this
relatively small feature in. But the collation behaviour is something
we'll have to deal with and support for years and will have trouble
changing in subsequent versions without compatibility issues. This
behaviour is clearly something that can be added onto in subsequent
versions and the existing feature set is usable as it is.

Also, for what it's worth I prefer thinking of
synchronous_commit/synchronous_replication as one big multi-way
variable:

synchronous_commit = memory | disk | replica-memory | replica-disk |
replica-visible

--
greg

#4Simon Riggs
simon@2ndquadrant.com
In reply to: Robert Haas (#2)
Re: Additional options for Sync Replication

On Sun, Mar 27, 2011 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 27, 2011 at 5:45 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I was hoping to fine tune/tweak Sync Rep after feedback during beta,
but my understanding of current consensus is that that will be too
late to make user visible changes. So I'm proposing this change now,
before Beta, rather than during Beta.

This is completely inappropriate.  The deadline for new feature
patches has long since passed.  It was January 15th.  The discussion
you are referring to had to do with fixing the behavior of features we
already have, not adding new ones.

You skipped the bit that said this code was part of the original patch
submission, so this code met your deadline. That was stated clearly in
the next paragraph of my email.

It's also a minor change and one that others had agreed we want.

You have no basis on which to prevent this.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#5Simon Riggs
simon@2ndquadrant.com
In reply to: Greg Stark (#3)
Re: Additional options for Sync Replication

On Sun, Mar 27, 2011 at 11:55 PM, Greg Stark <gsstark@mit.edu> wrote:

Also, for what it's worth I prefer thinking of
synchronous_commit/synchronous_replication as one big multi-way
variable:

synchronous_commit  = memory | disk | replica-memory | replica-disk |
replica-visible

That's close enough to my thinking for me to say yes to.

I'd prefer to call it "commit_durability" though.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#4)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 10:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

You have no basis on which to prevent this.

It's also already on the Open Items list, put there by you.

Why is this even a discussion point?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#6)
Re: Additional options for Sync Replication

On 28.03.2011 13:14, Simon Riggs wrote:

On Mon, Mar 28, 2011 at 10:52 AM, Simon Riggs<simon@2ndquadrant.com> wrote:

You have no basis on which to prevent this.

It's also already on the Open Items list, put there by you.

Why is this even a discussion point?

FWIW, I agree this is an additional feature that we shouldn't be messing
with at this point in the release cycle.

The 'apply' mode would be quite interesting, it would make it easier to
build load-balancing clusters. But the patch isn't up to the task on
that yet - the 'apply' status report is only sent after
wal_receiver_status_interval fills up, so you get long delays.

PS. you're missing some break's in the switch-statement in
SyncRepWaitForLSN(), effectively making the patch do nothing.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#6)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 6:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, Mar 28, 2011 at 10:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

You have no basis on which to prevent this.

It's also already on the Open Items list, put there by you.

Huh? There is an open item about whether we should merge
synchronous_commit and synchronous_replication into a single GUC,
which might be a better interface since it would typically give the
user one less thing to have to fiddle with, but there is certainly no
open item for adding additional sync rep modes. Merging those two
GUCs is a reasonable thing to consider even at this late date, because
once we cut beta - and certainly once we cut final - we'll be stuck
supporting whatever interface we release for 5+ years. There is no
similar risk for this patch - the only risk of not committing this
feature now is that we won't have this feature in 9.1. But if we
accept that as valid justification for committing it, then everything
is fair game, and that is not going to lead to a timely release.

Why is this even a discussion point?

Adding more new code is likely to add more new bugs, and we're trying
not to do that right now, so we can make a release.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Simon Riggs
simon@2ndquadrant.com
In reply to: Heikki Linnakangas (#7)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 12:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 28.03.2011 13:14, Simon Riggs wrote:

On Mon, Mar 28, 2011 at 10:52 AM, Simon Riggs<simon@2ndquadrant.com>
 wrote:

You have no basis on which to prevent this.

It's also already on the Open Items list, put there by you.

Why is this even a discussion point?

FWIW, I agree this is an additional feature that we shouldn't be messing
with at this point in the release cycle.

There is an open item about what the UI is for sync commit/sync rep,
which is the subject of this patch.

The 'apply' mode would be quite interesting, it would make it easier to
build load-balancing clusters. But the patch isn't up to the task on that
yet - the 'apply' status report is only sent after
wal_receiver_status_interval fills up, so you get long delays.

Yes it's up to the task, you misread it. It will continue sending
replies while apply <> flush and then it will fall back to the
behaviour you mention.

PS. you're missing some break's in the switch-statement in
SyncRepWaitForLSN(), effectively making the patch do nothing.

OK, thanks.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#5)
1 attachment(s)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 10:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sun, Mar 27, 2011 at 11:55 PM, Greg Stark <gsstark@mit.edu> wrote:

Also, for what it's worth I prefer thinking of
synchronous_commit/synchronous_replication as one big multi-way
variable:

synchronous_commit  = memory | disk | replica-memory | replica-disk |
replica-visible

That's close enough to my thinking for me to say yes to.

Patch to implement this new UI attached, exactly as you suggest above.

Words can be changed easily without changing the music.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

syncrep_ui.v2.cpatchapplication/octet-stream; name=syncrep_ui.v2.cpatchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1505,1512 **** SET ENABLE_SEQSCAN TO OFF;
        </indexterm>
        <listitem>
         <para>
!         Specifies whether transaction commit will wait for WAL records
!         to be written to disk before the command returns a <quote>success</>
          indication to the client.  The default, and safe, setting is
          <literal>on</>.  When <literal>off</>, there can be a delay between
          when success is reported to the client and when the transaction is
--- 1505,1512 ----
        </indexterm>
        <listitem>
         <para>
!         Specifies the level of durability for transaction commit, defining
!         what actions must take place before the command returns a <quote>success</>
          indication to the client.  The default, and safe, setting is
          <literal>on</>.  When <literal>off</>, there can be a delay between
          when success is reported to the client and when the transaction is
***************
*** 1523,1528 **** SET ENABLE_SEQSCAN TO OFF;
--- 1523,1558 ----
          discussion see <xref linkend="wal-async-commit">.
         </para>
         <para>
+         In addition to <literal>on</> and <literal>off</> it is possible
+         to specify one of five different levels of durability.
+         The allowed values of <varname>synchronous_replication</> are
+         <literal>off</> do not wait for a local flush to disk.
+         <literal>on</> wait for flush to disk, as desribed above.
+         <literal>memory</> is a synonym for <literal>off</>.
+         <literal>disk</> is a synonym for <literal>on</>.
+         <literal>replica-memory</> commit waits for standby to receive into memory (no sync),
+         <literal>replica-disk</> commit waits for sync to disk on standby,
+         <literal>replica-visible</> commit waits for standby to apply changes so that they
+         will be visible to queries on standby server.
+        </para>
+        <para>
+         Specifying <literal>replica-memory</>, <literal>replica-disk</> or
+         <literal>replica-visible</> means that there will be a delay while
+         the clients waits for confirmation of successful replication. That
+         delay will increase depending upon the physical distance and network
+         activity between primary and standby. The commit wait will last until a
+         reply from the current synchronous standby indicates the requested
+         level of confirmation. <literal>replica-disk</> is safest, while
+         <literal>replica-memory</> can be a little faster.
+         <literal>replica-visible</> allows a more consistent cluster
+         from the application perspective and is useful for load balanced
+         clsuters.
+         The replica settings of this parameter have no effect if
+         <xref linkend="guc-synchronous-standby-names"> is empty or
+         <xref linkend="guc-max-wal-senders"> is zero. In that case the
+         parameter is taken to mean the same as <literal>disk</>.
+        </para>
+        <para>
          This parameter can be changed at any time; the behavior for any
          one transaction is determined by the setting in effect when it
          commits.  It is therefore possible, and useful, to have some
***************
*** 2033,2071 **** SET ENABLE_SEQSCAN TO OFF;
       </para>
  
       <variablelist>
-      <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
-       <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
-       <indexterm>
-        <primary><varname>synchronous_replication</> configuration parameter</primary>
-       </indexterm>
-       <listitem>
-        <para>
-         Specifies whether transaction commit will wait for WAL records
-         to be replicated before the command returns a <quote>success</>
-         indication to the client.  The default setting is <literal>off</>.
-         When <literal>on</>, there will be a delay while the client waits
-         for confirmation of successful replication. That delay will
-         increase depending upon the physical distance and network activity
-         between primary and standby. The commit wait will last until a
-         reply from the current synchronous standby indicates it has written
-         the commit record of the transaction to durable storage.  This
-         parameter has no effect if
-         <xref linkend="guc-synchronous-standby-names"> is empty or
-         <xref linkend="guc-max-wal-senders"> is zero.
-        </para>
-        <para>
-         This parameter can be changed at any time; the
-         behavior for any one transaction is determined by the setting in
-         effect when it commits.  It is therefore possible, and useful, to have
-         some transactions replicate synchronously and others asynchronously.
-         For example, to make a single multistatement transaction commit
-         asynchronously when the default is synchronous replication, issue
-         <command>SET LOCAL synchronous_replication TO OFF</> within the
-         transaction.
-        </para>
-       </listitem>
-      </varlistentry>
- 
       <varlistentry id="guc-synchronous-standby-names" xreflabel="synchronous_standby_names">
        <term><varname>synchronous_standby_names</varname> (<type>string</type>)</term>
        <indexterm>
--- 2063,2068 ----
***************
*** 2100,2106 **** SET ENABLE_SEQSCAN TO OFF;
          If a standby is removed from the list of servers then it will stop
          being the synchronous standby, allowing another to take its place.
          If the list is empty, synchronous replication will not be
!         possible, whatever the setting of <varname>synchronous_replication</>.
          Standbys may also be added to the list without restarting the server.
         </para>
        </listitem>
--- 2097,2103 ----
          If a standby is removed from the list of servers then it will stop
          being the synchronous standby, allowing another to take its place.
          If the list is empty, synchronous replication will not be
!         possible, whatever the setting of <varname>synchronous_commit</>.
          Standbys may also be added to the list without restarting the server.
         </para>
        </listitem>
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 68,75 **** bool		XactReadOnly;
  bool		DefaultXactDeferrable = false;
  bool		XactDeferrable;
  
- bool		XactSyncCommit = true;
- 
  int			CommitDelay = 0;	/* precommit delay in microseconds */
  int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
  
--- 68,73 ----
***************
*** 1056,1062 **** RecordTransactionCommit(void)
  	 * if all to-be-deleted tables are temporary though, since they are lost
  	 * anyway if we crash.)
  	 */
! 	if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
  	{
  		/*
  		 * Synchronous commit case:
--- 1054,1061 ----
  	 * if all to-be-deleted tables are temporary though, since they are lost
  	 * anyway if we crash.)
  	 */
! 	if ((wrote_xlog && synchronous_commit >= SYNC_COMMIT_LOCAL_FLUSH) ||
! 		forceSyncCommit || nrels > 0)
  	{
  		/*
  		 * Synchronous commit case:
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 63,69 ****
  #include "utils/ps_status.h"
  
  /* User-settable parameters for sync rep */
! bool	synchronous_replication = false;		/* Only set in user backends */
  char 	*SyncRepStandbyNames;
  
  #define SyncStandbysDefined() \
--- 63,69 ----
  #include "utils/ps_status.h"
  
  /* User-settable parameters for sync rep */
! int	synchronous_commit = SYNC_COMMIT_LOCAL_FLUSH;	/* Only set in user backends */
  char 	*SyncRepStandbyNames;
  
  #define SyncStandbysDefined() \
***************
*** 71,77 **** char 	*SyncRepStandbyNames;
  
  static bool announce_next_takeover = true;
  
! static void SyncRepQueueInsert(void);
  static void SyncRepCancelWait(void);
  
  static int SyncRepGetStandbyPriority(void);
--- 71,77 ----
  
  static bool announce_next_takeover = true;
  
! static void SyncRepQueueInsert(int queue);
  static void SyncRepCancelWait(void);
  
  static int SyncRepGetStandbyPriority(void);
***************
*** 105,111 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * there are no sync replication standby names defined.
  	 * Note that those standbys don't need to be connected.
  	 */
! 	if (!SyncRepRequested() || !SyncStandbysDefined())
  		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
--- 105,112 ----
  	 * there are no sync replication standby names defined.
  	 * Note that those standbys don't need to be connected.
  	 */
! 	if (!SyncStandbysDefined() || max_wal_senders == 0 ||
! 		synchronous_commit < SYNC_COMMIT_REMOTE_WRITE)
  		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
***************
*** 126,132 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * so its likely to be a low cost check.
  	 */
  	if (!WalSndCtl->sync_standbys_defined ||
! 		XLByteLE(XactCommitLSN, WalSndCtl->lsn))
  	{
  		LWLockRelease(SyncRepLock);
  		return;
--- 127,133 ----
  	 * so its likely to be a low cost check.
  	 */
  	if (!WalSndCtl->sync_standbys_defined ||
! 		XLByteLE(XactCommitLSN, WalSndCtl->lsn[synchronous_commit]))
  	{
  		LWLockRelease(SyncRepLock);
  		return;
***************
*** 138,144 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 */
  	MyProc->waitLSN = XactCommitLSN;
  	MyProc->syncRepState = SYNC_REP_WAITING;
! 	SyncRepQueueInsert();
  	Assert(SyncRepQueueIsOrderedByLSN());
  	LWLockRelease(SyncRepLock);
  
--- 139,145 ----
  	 */
  	MyProc->waitLSN = XactCommitLSN;
  	MyProc->syncRepState = SYNC_REP_WAITING;
! 	SyncRepQueueInsert(synchronous_commit);
  	Assert(SyncRepQueueIsOrderedByLSN());
  	LWLockRelease(SyncRepLock);
  
***************
*** 273,284 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
   * here out of order, so start at tail and work back to insertion point.
   */
  static void
! SyncRepQueueInsert(void)
  {
  	PGPROC	*proc;
  
! 	proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
! 								   &(WalSndCtl->SyncRepQueue),
  								   offsetof(PGPROC, syncRepLinks));
  
  	while (proc)
--- 274,285 ----
   * here out of order, so start at tail and work back to insertion point.
   */
  static void
! SyncRepQueueInsert(int queue)
  {
  	PGPROC	*proc;
  
! 	proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[queue]),
! 								   &(WalSndCtl->SyncRepQueue[queue]),
  								   offsetof(PGPROC, syncRepLinks));
  
  	while (proc)
***************
*** 290,296 **** SyncRepQueueInsert(void)
  		if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
  			break;
  
! 		proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
  									   &(proc->syncRepLinks),
  									   offsetof(PGPROC, syncRepLinks));
  	}
--- 291,297 ----
  		if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
  			break;
  
! 		proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[queue]),
  									   &(proc->syncRepLinks),
  									   offsetof(PGPROC, syncRepLinks));
  	}
***************
*** 298,304 **** SyncRepQueueInsert(void)
  	if (proc)
  		SHMQueueInsertAfter(&(proc->syncRepLinks), &(MyProc->syncRepLinks));
  	else
! 		SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue), &(MyProc->syncRepLinks));
  }
  
  /*
--- 299,305 ----
  	if (proc)
  		SHMQueueInsertAfter(&(proc->syncRepLinks), &(MyProc->syncRepLinks));
  	else
! 		SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue[queue]), &(MyProc->syncRepLinks));
  }
  
  /*
***************
*** 421,442 **** SyncRepReleaseWaiters(void)
  		return;
  	}
  
! 	if (XLByteLT(walsndctl->lsn, MyWalSnd->flush))
! 	{
! 		/*
! 		 * Set the lsn first so that when we wake backends they will
! 		 * release up to this location.
! 		 */
! 		walsndctl->lsn = MyWalSnd->flush;
! 		numprocs = SyncRepWakeQueue(false);
! 	}
  
! 	LWLockRelease(SyncRepLock);
  
! 	elog(DEBUG3, "released %d procs up to %X/%X",
! 					numprocs,
! 					MyWalSnd->flush.xlogid,
! 					MyWalSnd->flush.xrecoff);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
--- 422,441 ----
  		return;
  	}
  
! 	/*
! 	 * Set the lsn first so that when we wake backends they will
! 	 * release up to this location.
! 	 */
! 	if (XLByteLT(walsndctl->lsn[0], MyWalSnd->write))
! 		walsndctl->lsn[0] = MyWalSnd->write;
! 	if (XLByteLT(walsndctl->lsn[1], MyWalSnd->flush))
! 		walsndctl->lsn[1] = MyWalSnd->flush;
! 	if (XLByteLT(walsndctl->lsn[2], MyWalSnd->apply))
! 		walsndctl->lsn[2] = MyWalSnd->apply;
  
! 	numprocs = SyncRepWakeQueue(false);
  
! 	LWLockRelease(SyncRepLock);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
***************
*** 515,563 **** SyncRepWakeQueue(bool all)
  	PGPROC	*proc = NULL;
  	PGPROC	*thisproc = NULL;
  	int		numprocs = 0;
  
  	Assert(SyncRepQueueIsOrderedByLSN());
  
! 	proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! 								   &(WalSndCtl->SyncRepQueue),
! 								   offsetof(PGPROC, syncRepLinks));
! 
! 	while (proc)
  	{
! 		/*
! 		 * Assume the queue is ordered by LSN
! 		 */
! 		if (!all && XLByteLT(walsndctl->lsn, proc->waitLSN))
! 			return numprocs;
! 
! 		/*
! 		 * Move to next proc, so we can delete thisproc from the queue.
! 		 * thisproc is valid, proc may be NULL after this.
! 		 */
! 		thisproc = proc;
! 		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! 									   &(proc->syncRepLinks),
  									   offsetof(PGPROC, syncRepLinks));
  
! 		/*
! 		 * Set state to complete; see SyncRepWaitForLSN() for discussion
! 		 * of the various states.
! 		 */
! 		thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
! 
! 		/*
! 		 * Remove thisproc from queue.
! 		 */
! 		SHMQueueDelete(&(thisproc->syncRepLinks));
! 
! 		/*
! 		 * Wake only when we have set state and removed from queue.
! 		 */
! 		Assert(SHMQueueIsDetached(&(thisproc->syncRepLinks)));
! 		Assert(thisproc->syncRepState == SYNC_REP_WAIT_COMPLETE);
! 		SetLatch(&(thisproc->waitLatch));
! 
! 		numprocs++;
  	}
  
  	return numprocs;
--- 514,566 ----
  	PGPROC	*proc = NULL;
  	PGPROC	*thisproc = NULL;
  	int		numprocs = 0;
+ 	int		i;
  
  	Assert(SyncRepQueueIsOrderedByLSN());
  
! 	for (i = 0; i < MAX_SYNC_REP_QUEUES; i++)
  	{
! 		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[i]),
! 									   &(WalSndCtl->SyncRepQueue[i]),
  									   offsetof(PGPROC, syncRepLinks));
  
! 		while (proc)
! 		{
! 			/*
! 			 * Assume the queue is ordered by LSN
! 			 */
! 			if (!all && XLByteLT(walsndctl->lsn[i], proc->waitLSN))
! 				return numprocs;
! 
! 			/*
! 			 * Move to next proc, so we can delete thisproc from the queue.
! 			 * thisproc is valid, proc may be NULL after this.
! 			 */
! 			thisproc = proc;
! 			proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[i]),
! 										   &(proc->syncRepLinks),
! 										   offsetof(PGPROC, syncRepLinks));
! 
! 			/*
! 			 * Set state to complete; see SyncRepWaitForLSN() for discussion
! 			 * of the various states.
! 			 */
! 			thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
! 
! 			/*
! 			 * Remove thisproc from queue.
! 			 */
! 			SHMQueueDelete(&(thisproc->syncRepLinks));
! 
! 			/*
! 			 * Wake only when we have set state and removed from queue.
! 			 */
! 			Assert(SHMQueueIsDetached(&(thisproc->syncRepLinks)));
! 			Assert(thisproc->syncRepState == SYNC_REP_WAIT_COMPLETE);
! 			SetLatch(&(thisproc->waitLatch));
! 
! 			numprocs++;
! 		}
  	}
  
  	return numprocs;
***************
*** 606,633 **** SyncRepQueueIsOrderedByLSN(void)
  {
  	PGPROC	*proc = NULL;
  	XLogRecPtr lastLSN;
  
! 	lastLSN.xlogid = 0;
! 	lastLSN.xrecoff = 0;
! 
! 	proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! 								   &(WalSndCtl->SyncRepQueue),
! 								   offsetof(PGPROC, syncRepLinks));
! 
! 	while (proc)
  	{
! 		/*
! 		 * Check the queue is ordered by LSN and that multiple
! 		 * procs don't have matching LSNs
! 		 */
! 		if (XLByteLE(proc->waitLSN, lastLSN))
! 			return false;
! 
! 		lastLSN = proc->waitLSN;
  
! 		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! 									   &(proc->syncRepLinks),
  									   offsetof(PGPROC, syncRepLinks));
  	}
  
  	return true;
--- 609,640 ----
  {
  	PGPROC	*proc = NULL;
  	XLogRecPtr lastLSN;
+ 	int		i;
  
! 	for (i = 0; i < MAX_SYNC_REP_QUEUES; i++)
  	{
! 		lastLSN.xlogid = 0;
! 		lastLSN.xrecoff = 0;
  
! 		proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[i]),
! 									   &(WalSndCtl->SyncRepQueue[i]),
  									   offsetof(PGPROC, syncRepLinks));
+ 
+ 		while (proc)
+ 		{
+ 			/*
+ 			 * Check the queue is ordered by LSN and that multiple
+ 			 * procs don't have matching LSNs
+ 			 */
+ 			if (XLByteLE(proc->waitLSN, lastLSN))
+ 				return false;
+ 
+ 			lastLSN = proc->waitLSN;
+ 
+ 			proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[i]),
+ 										   &(proc->syncRepLinks),
+ 										   offsetof(PGPROC, syncRepLinks));
+ 		}
  	}
  
  	return true;
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 610,623 **** XLogWalRcvSendReply(void)
  	/*
  	 * We can compare the write and flush positions to the last message we
  	 * sent without taking any lock, but the apply position requires a spin
! 	 * lock, so we don't check that unless something else has changed or 10
! 	 * seconds have passed.  This means that the apply log position will
! 	 * appear, from the master's point of view, to lag slightly, but since
! 	 * this is only for reporting purposes and only on idle systems, that's
! 	 * probably OK.
  	 */
  	if (XLByteEQ(reply_message.write, LogstreamResult.Write)
  		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
  		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
  			wal_receiver_status_interval * 1000))
  		return;
--- 610,621 ----
  	/*
  	 * We can compare the write and flush positions to the last message we
  	 * sent without taking any lock, but the apply position requires a spin
! 	 * lock, so we just check whether the last message showed that all
! 	 * flushed WAL data has been applied.
  	 */
  	if (XLByteEQ(reply_message.write, LogstreamResult.Write)
  		&& XLByteEQ(reply_message.flush, LogstreamResult.Flush)
+ 		&& XLByteEQ(reply_message.flush, reply_message.apply)
  		&& !TimestampDifferenceExceeds(reply_message.sendTime, now,
  			wal_receiver_status_interval * 1000))
  		return;
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 1247,1253 **** WalSndShmemInit(void)
  		/* First time through, so initialize */
  		MemSet(WalSndCtl, 0, WalSndShmemSize());
  
! 		SHMQueueInit(&(WalSndCtl->SyncRepQueue));
  
  		for (i = 0; i < max_wal_senders; i++)
  		{
--- 1247,1256 ----
  		/* First time through, so initialize */
  		MemSet(WalSndCtl, 0, WalSndShmemSize());
  
! 		for (i = 0; i < MAX_SYNC_REP_QUEUES; i++)
! 		{
! 			SHMQueueInit(&(WalSndCtl->SyncRepQueue[i]));
! 		}
  
  		for (i = 0; i < max_wal_senders; i++)
  		{
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 747,768 **** static struct config_bool ConfigureNamesBool[] =
  		true, NULL, NULL
  	},
  	{
- 		{"synchronous_commit", PGC_USERSET, WAL_SETTINGS,
- 			gettext_noop("Sets immediate fsync at commit."),
- 			NULL
- 		},
- 		&XactSyncCommit,
- 		true, NULL, NULL
- 	},
- 	{
- 		{"synchronous_replication", PGC_USERSET, WAL_REPLICATION,
- 			gettext_noop("Requests synchronous replication."),
- 			NULL
- 		},
- 		&synchronous_replication,
- 		false, NULL, NULL
- 	},
- 	{
  		{"zero_damaged_pages", PGC_SUSET, DEVELOPER_OPTIONS,
  			gettext_noop("Continues processing past damaged page headers."),
  			gettext_noop("Detection of a damaged page header normally causes PostgreSQL to "
--- 747,752 ----
***************
*** 2818,2823 **** static struct config_enum ConfigureNamesEnum[] =
--- 2802,2817 ----
  	},
  
  	{
+ 		{"synchronous_commit", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Durability level requested at commit."),
+ 			NULL
+ 		},
+ 		&synchronous_commit,
+ 		SYNC_COMMIT_LOCAL_FLUSH, synchronous_commit_options,
+ 		NULL, NULL
+ 	},
+ 
+ 	{
  		{"default_transaction_isolation", PGC_USERSET, CLIENT_CONN_STATEMENT,
  			gettext_noop("Sets the transaction isolation level of each new transaction."),
  			NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 153,159 ****
  #wal_level = minimal			# minimal, archive, or hot_standby
  					# (change requires restart)
  #fsync = on				# turns forced synchronization on or off
! #synchronous_commit = on		# immediate fsync at commit
  #wal_sync_method = fsync		# the default is the first option
  					# supported by the operating system:
  					#   open_datasync
--- 153,167 ----
  #wal_level = minimal			# minimal, archive, or hot_standby
  					# (change requires restart)
  #fsync = on				# turns forced synchronization on or off
! 
! #synchronous_commit = on	# durability level requested
! 							# valid values: on, off, plus these values
! 							# memory = in local memory only, not on disk
! 							# disk = flushed to local disk only (same as "on")
! 							# replica-memory = in memory on standby
! 							# replica-disk = flushed to disk on standby
! 							# replica-visible = changes visible on standby
! 
  #wal_sync_method = fsync		# the default is the first option
  					# supported by the operating system:
  					#   open_datasync
***************
*** 184,193 ****
  #archive_timeout = 0		# force a logfile segment switch after this
  				# number of seconds; 0 disables
  
- # - Replication - User Settings
- 
- #synchronous_replication = off		# does commit wait for reply from standby
- 
  # - Streaming Replication - Server Settings
  
  #synchronous_standby_names = ''	# standby servers that provide sync rep
--- 192,197 ----
*** a/src/include/access/xact.h
--- b/src/include/access/xact.h
***************
*** 52,60 **** extern bool XactReadOnly;
  extern bool DefaultXactDeferrable;
  extern bool XactDeferrable;
  
- /* Asynchronous commits */
- extern bool XactSyncCommit;
- 
  /* Kluge for 2PC support */
  extern bool MyXactAccessedTempRel;
  
--- 52,57 ----
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 19,34 ****
  #include "storage/spin.h"
  #include "utils/guc.h"
  
- #define SyncRepRequested() \
- 	(synchronous_replication && max_wal_senders > 0)
- 
  /* syncRepState */
  #define SYNC_REP_NOT_WAITING		0
  #define SYNC_REP_WAITING			1
  #define SYNC_REP_WAIT_COMPLETE		2
  
  /* user-settable parameters for synchronous replication */
! extern bool synchronous_replication;
  extern char *SyncRepStandbyNames;
  
  /* called by user backend */
--- 19,63 ----
  #include "storage/spin.h"
  #include "utils/guc.h"
  
  /* syncRepState */
  #define SYNC_REP_NOT_WAITING		0
  #define SYNC_REP_WAITING			1
  #define SYNC_REP_WAIT_COMPLETE		2
  
+ typedef enum
+ {
+ 	SYNC_COMMIT_NO_LOCAL_FLUSH = -2,	/* same as old sync_commit = off */
+ 	SYNC_COMMIT_LOCAL_FLUSH,	/* default ACID behaviour */
+ 	SYNC_COMMIT_REMOTE_WRITE,	/* sync rep, wait for write of data on standby */
+ 	SYNC_COMMIT_REMOTE_FLUSH,	/* sync rep, wait for fsync of data on standby */
+ 	SYNC_COMMIT_REMOTE_APPLY	/* sync rep, wait for apply of data on standby */
+ } SyncCommitWaitType;
+ 
+ /*
+  * Accept documented options, plus undocumented variations of "on" and "off".
+  */
+ static const struct config_enum_entry synchronous_commit_options[] = {
+ 	{"memory", SYNC_COMMIT_NO_LOCAL_FLUSH, false},
+ 	{"disk", SYNC_COMMIT_LOCAL_FLUSH, false},
+ 	{"replica-memory", SYNC_COMMIT_REMOTE_WRITE, false},
+ 	{"replica-disk", SYNC_COMMIT_REMOTE_FLUSH, false},
+ 	{"replica-visible", SYNC_COMMIT_REMOTE_APPLY, false},
+ 	{"on", SYNC_COMMIT_LOCAL_FLUSH, false},
+ 	{"off", SYNC_COMMIT_NO_LOCAL_FLUSH, false},
+ 	{"true", SYNC_COMMIT_LOCAL_FLUSH, true},
+ 	{"false", SYNC_COMMIT_NO_LOCAL_FLUSH, true},
+ 	{"yes", SYNC_COMMIT_LOCAL_FLUSH, true},
+ 	{"no", SYNC_COMMIT_NO_LOCAL_FLUSH, true},
+ 	{"1", SYNC_COMMIT_LOCAL_FLUSH, true},
+ 	{"0", SYNC_COMMIT_NO_LOCAL_FLUSH, true},
+ 	{NULL, 0, false}
+ };
+ 
+ /* sync replication has separate queues for write, fsync and apply */
+ #define MAX_SYNC_REP_QUEUES			3
+ 
  /* user-settable parameters for synchronous replication */
! extern int synchronous_commit;
  extern char *SyncRepStandbyNames;
  
  /* called by user backend */
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 68,82 **** extern WalSnd *MyWalSnd;
  typedef struct
  {
  	/*
! 	 * Synchronous replication queue. Protected by SyncRepLock.
  	 */
! 	SHM_QUEUE SyncRepQueue;
  
  	/*
  	 * Current location of the head of the queue. All waiters should have
  	 * a waitLSN that follows this value. Protected by SyncRepLock.
  	 */
! 	XLogRecPtr	lsn;
  
  	/*
  	 * Are any sync standbys defined?  Waiting backends can't reload the
--- 68,82 ----
  typedef struct
  {
  	/*
! 	 * Synchronous replication queues. Protected by SyncRepLock.
  	 */
! 	SHM_QUEUE SyncRepQueue[MAX_SYNC_REP_QUEUES];
  
  	/*
  	 * Current location of the head of the queue. All waiters should have
  	 * a waitLSN that follows this value. Protected by SyncRepLock.
  	 */
! 	XLogRecPtr	lsn[MAX_SYNC_REP_QUEUES];
  
  	/*
  	 * Are any sync standbys defined?  Waiting backends can't reload the
#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#9)
Re: Additional options for Sync Replication

On 28.03.2011 15:34, Simon Riggs wrote:

On Mon, Mar 28, 2011 at 12:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 28.03.2011 13:14, Simon Riggs wrote:

On Mon, Mar 28, 2011 at 10:52 AM, Simon Riggs<simon@2ndquadrant.com>
wrote:

You have no basis on which to prevent this.

It's also already on the Open Items list, put there by you.

Why is this even a discussion point?

FWIW, I agree this is an additional feature that we shouldn't be messing
with at this point in the release cycle.

There is an open item about what the UI is for sync commit/sync rep,
which is the subject of this patch.

plus new functionality. For the UI part, you just need to change GUCs.

The 'apply' mode would be quite interesting, it would make it easier to
build load-balancing clusters. But the patch isn't up to the task on that
yet - the 'apply' status report is only sent after
wal_receiver_status_interval fills up, so you get long delays.

Yes it's up to the task, you misread it. It will continue sending
replies while apply<> flush and then it will fall back to the
behaviour you mention.

There's nothing to wake up the walreceiver after applying a commit record.

Oh, you're relying on the periodic wakeups specified by
NAPTIME_PER_CYCLE (100ms). That's still on average a 50ms delay to every
commit. We should try to eliminate these polling loops, not make them
more important. In fact, we should raise NAPTIME_PER_CYCLE to, say,
1000ms, to reduce spurious wakeups on an idle system.

Am I reading the patch correctly that if the standby hasn't applied all
WAL yet, you send a reply message at every wakeup, whether or not any
progress has been made since last time? So if you have a
long-running-transaction in the standby, for example, conflicting with
WAL recovery, the standby will keep sending a status report to the
master every 100ms.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#8)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

but there is certainly no
open item for adding additional sync rep modes.

In your opinion.

We will have to live with the UI for a long time, yes. I'm trying to
get it right, whether that includes adding an obvious/easy omission or
renaming things to make better sense.

Your other changes make this sensible, and feedback I received after a
recent presentation tells me that people will expect it to work with
the two additional options.

I would prefer further feedback before solidifying this design, but if
we must solidify it now, then I prefer to do that with all 5 options.
As originally submitted for this commit fest.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#13Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#12)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 8:54 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, Mar 28, 2011 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

but there is certainly no
open item for adding additional sync rep modes.

In your opinion.

Well, as you just pointed out yourself a few minutes ago, I did write
the open item in question... I fancy I knew what I meant.

I would prefer further feedback before solidifying this design, but if
we must solidify it now, then I prefer to do that with all 5 options.
As originally submitted for this commit fest.

Even if it were true that these options were in the patch submitted
for this CommitFest, that wouldn't be a reason to commit them now,
because the CommitFest is over and has been for several weeks. But it
turns out they weren't.

http://archives.postgresql.org/message-id/1295127631.3282.100.camel@ebony

+/*
+ * Queuing code is written to allow later extension to multiple
+ * queues. Currently, we use just one queue (==FSYNC).
+ *
+ * XXX We later expect to have RECV, FSYNC and APPLY modes.
+ */

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#12)
Re: Additional options for Sync Replication

On 28.03.2011 15:54, Simon Riggs wrote:

On Mon, Mar 28, 2011 at 1:14 PM, Robert Haas<robertmhaas@gmail.com> wrote:

but there is certainly no
open item for adding additional sync rep modes.

In your opinion.

Huh? First you say that Robert added an open item about this to the
list, he says that the open item wasn't about additional sync rep modes,
and you say "in your opinion".

We will have to live with the UI for a long time, yes. I'm trying to
get it right, whether that includes adding an obvious/easy omission or
renaming things to make better sense.

Your other changes make this sensible, and feedback I received after a
recent presentation tells me that people will expect it to work with
the two additional options.

I would prefer further feedback before solidifying this design, but if
we must solidify it now, then I prefer to do that with all 5 options.
As originally submitted for this commit fest.

It's too late to be doing this. The patch isn't ready to be committed,
and there's high potential to introduce new bugs or usability issues.
And regarding the UI, I'm not totally convinced that a four-state GUC
set in the master is the way go. It would feel at least as logical to
control this in the standby.

I don't really want to get into that discussion, though. My point is
that if we wanted to still sneak this in, then we would have to have
those discussions. -1. Let's fix the existing issues, and release.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#11)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 1:51 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The 'apply' mode would be quite interesting, it would make it easier to
build load-balancing clusters. But the patch isn't up to the task on that
yet - the 'apply' status report is only sent after
wal_receiver_status_interval fills up, so you get long delays.

Yes it's up to the task, you misread it. It will continue sending
replies while apply<>  flush and then it will fall back to the
behaviour you mention.

There's nothing to wake up the walreceiver after applying a commit record.

Oh, you're relying on the periodic wakeups specified by NAPTIME_PER_CYCLE
(100ms). That's still on average a 50ms delay to every commit.

Rubbish. "Every commit", no way. How could you think that?

That delay affects only the last commit of a sequence of commits after
which the server goes quiet. And that only applies to people that have
specifically chosen to wait for the apply, which as you say means they
may have fairly long waits anyway.

We should try
to eliminate these polling loops, not make them more important. In fact, we
should raise NAPTIME_PER_CYCLE to, say, 1000ms, to reduce spurious wakeups
on an idle system.

I'm OK with changing the polling loops. Is there a big problem there?

I think it would take you about an hour to rewrite that, if you wanted
to do so. But its not a necessary step to do that, especially when
we're discussing whether Latches are actually as portable as we think.
That sounds like more risk for a slight gain in performance.

Am I reading the patch correctly that if the standby hasn't applied all WAL
yet, you send a reply message at every wakeup, whether or not any progress
has been made since last time? So if you have a long-running-transaction in
the standby, for example, conflicting with WAL recovery, the standby will
keep sending a status report to the master every 100ms.

Sure.

First, is that a problem? Why? The WalReceiver isn't busy doing
anything else at that point, by definition. The WalSender isn't doing
anything then either, by definition. Both are used to higher send
rates.

Second, if that really is a problem, sounds like a simple "if" test
added to reply code.

Third, the conflict issues are much reduced as well in this release.
For this exact purpose.

So I don't see any blockers in what you say.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#14)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 2:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

It would feel at least as logical to control this in the standby.

Now you are being ridiculous. You've spoken strongly against this at
every single step of this journey.

We're well passed the stage of putting anything in that could do that,
as well you know.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#16)
Re: Additional options for Sync Replication

On 28.03.2011 16:11, Simon Riggs wrote:

On Mon, Mar 28, 2011 at 2:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

It would feel at least as logical to control this in the standby.

Now you are being ridiculous. You've spoken strongly against this at
every single step of this journey.

I was thinking specifically about whether flush vs. write (vs. apply,
maybe) here. It would make sense to set that in the standby. You might
even want to set it differently on different standbys.

What I was strongly against is the action at a distance, where setting a
GUC in a standby suddenly makes the master to wait for acks from that
server. That's dangerous, but I don't see such danger in setting the
level of synchronicity in the standby, once you've decided that it's a
synchronous standby.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#17)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 10:11 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 28.03.2011 16:11, Simon Riggs wrote:

On Mon, Mar 28, 2011 at 2:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

 It would feel at least as logical to control this in the standby.

Now you are being ridiculous. You've spoken strongly against this at
every single step of this journey.

I was thinking specifically about whether flush vs. write (vs. apply, maybe)
here. It would make sense to set that in the standby. You might even want to
set it differently on different standbys.

What I was strongly against is the action at a distance, where setting a GUC
in a standby suddenly makes the master to wait for acks from that server.
That's dangerous, but I don't see such danger in setting the level of
synchronicity in the standby, once you've decided that it's a synchronous
standby.

It might not be dangerous, but the standby currently sends write,
flush, and apply positions back separately, so the master must decide
which of those to pay attention to, unless we rework the whole design.
I actually think the current design is quite nice and am in no hurry
to rejigger that particular part of it. However, I do think that we
may need or want to rejigger the timing of replies for performance.
It might be, for example, that when waiting for the "write", the
master should somehow indicate to the standby the earliest write-LSN
that could release waiters, so that a standby can send back a reply
the instant it hits that LSN, without waiting to finish reading
everything that's buffered up as it does now. For apply, I agree with
your feeling that the startup process needs to wake walreceiver via a
latch when the apply position advances, but here again it might be
beneficial to send the first interesting LSN from master to standby to
cut down unnecessary wakeups. We also need to consider the issue
raised elsewhere - that a naive implementation of this could allow the
commit to become visible on the standby before it becomes visible on
the master. That would be expensive to prevent, because you'd need an
additional set of master-standby interlocks, but I think at least one
person was arguing that it was necessary for correctness - my memory
of the details is fuzzy at the moment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#17)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 3:11 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 28.03.2011 16:11, Simon Riggs wrote:

On Mon, Mar 28, 2011 at 2:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com>  wrote:

 It would feel at least as logical to control this in the standby.

Now you are being ridiculous. You've spoken strongly against this at
every single step of this journey.

I was thinking specifically about whether flush vs. write (vs. apply, maybe)
here. It would make sense to set that in the standby. You might even want to
set it differently on different standbys.

What I was strongly against is the action at a distance, where setting a GUC
in a standby suddenly makes the master to wait for acks from that server.
That's dangerous, but I don't see such danger in setting the level of
synchronicity in the standby, once you've decided that it's a synchronous
standby.

The action at a distance thought still applies, since you would wait
more or less time depending upon how this parameter was set on the
standby.
I can't see how this situation differs. Your own argument still applies.

I would point out that I argued against you, but was persuaded towards
your approach. The code won't easily allow what you suggest. There
were multiple approaches to implementation, but there aren't anymore.
So you can't say the UI is unclear; its very clear how we implement
things from here - the way this patch implements it - on the master.

This is a simple patch, containing functionality already discussed and
agreed and for which code was submitted in this CF.

There's no reason to block this, only for us to decide the final
naming of parameter/s and options.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#20Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#18)
Re: Additional options for Sync Replication

Robert Haas <robertmhaas@gmail.com> wrote:

We also need to consider the issue raised elsewhere - that a naive
implementation of this could allow the commit to become visible on
the standby before it becomes visible on the master. That would
be expensive to prevent, because you'd need an additional set of
master-standby interlocks, but I think at least one person was
arguing that it was necessary for correctness - my memory of the
details is fuzzy at the moment.

I remember expressing concern about that; I don't know if anyone
else did. After some discussion, I was persuaded that the use cases
where it would matter are narrow enough that documentation of the
issue should be enough.

-Kevin

#21Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#19)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 10:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

This is a simple patch, containing functionality already discussed and
agreed and for which code was submitted in this CF.

These statements are simply not accurate. It isn't a simple patch,
the details of how the write and apply modes should work haven't been
fully discussed, and there is no version of your patch submitted for
the last CommitFest which contains any of this functionality.
Moreover, that CommitFest is OVER, so your repeated references to it
as "this CF" rather than "the last CF" do not match my view of how the
world works.

There's no reason to block this, only for us to decide the final
naming of parameter/s and options.

At the end of the day, we make these decisions by consensus, and three
people have said quite clearly that they believe it is too late to
apply something like this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22Greg Stark
gsstark@mit.edu
In reply to: Robert Haas (#18)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 3:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

It might not be dangerous, but the standby currently sends write,
flush, and apply positions back separately, so the master must decide
which of those to pay attention to, unless we rework the whole design.
 I actually think the current design is quite nice and am in no hurry
to rejigger that particular part of it.

In particular what I like about the current design is that there's no
reason you shouldn't be able to change the commit durability setting
per-transacion. You might want to have logging records be
asynchronous, regular operations be synchronous on the master, and
opeations involving money block until the slave has received them or
synced them or even applied them. Or you might want to mark just the
transactions affecting the data that your read-only queries which are
load-balanced on slaves as blocking until the slave has applied them
so people don't see inconsistent old data making it look like the
transaction failed.

--
greg

#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#21)
Re: Additional options for Sync Replication

On Mon, Mar 28, 2011 at 4:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 28, 2011 at 10:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

This is a simple patch, containing functionality already discussed and
agreed and for which code was submitted in this CF.

These statements are simply not accurate.

Then you are mistaken on every single point.

It isn't a simple patch,

Yes, it is. Most of the patch is docs and GUC changes. The current
patch was specifically designed by me to allow this to be added. One
queue is replaced easily by 3 queues, which essentially touches only 2
places in the current code, sleep and wakeup. And it touches them in
very simple ways. Variable to Array. Easy.

the details of how the write and apply modes should work haven't been
fully discussed,

The original patch had all 3 modes side-by-side, all working identically.
There has never been any objection or discussion about that and its
been part of the design for months.
What different approach is there?

and there is no version of your patch submitted for
the last CommitFest which contains any of this functionality.

v9, submitted on 15 Jan contains this functionality.

Moreover, that CommitFest is OVER, so your repeated references to it
as "this CF" rather than "the last CF" do not match my view of how the
world works.

We are still applying patches, for various reasons. I must have missed
your objections to Tom's proposals to fix un-neat things about
collations, or his additions to the extensions features. Go say what
you've said here to him as well.

There's no reason to block this, only for us to decide the final
naming of parameter/s and options.

At the end of the day, we make these decisions by consensus, and three
people have said quite clearly that they believe it is too late to
apply something like this.

"Something like this". Well given that all of the facts on which
you've based your decision are wrong, I expect you to revise your
decision.

There is nothing about this patch that makes it possible or sensible
to exclude it. You wish to continue to discuss Network timeouts. Why
are they "in" and this "out". Both were submitted as part of my
original patch....

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#24Yeb Havinga
yebhavinga@gmail.com
In reply to: Greg Stark (#3)
Re: Additional options for Sync Replication

On Sun, Mar 27, 2011 at 10:45 PM, Simon Riggs<simon@2ndquadrant.com> wrote:

I was hoping to fine tune/tweak Sync Rep after feedback during beta,
but my understanding of current consensus is that that will be too
late to make user visible changes. So I'm proposing this change now,
before Beta, rather than during Beta.

For what it's worth I think this is a simplification. We have:

1) Development when new features are added

2) Feature freeze - when those features are tweaked and fixed based on
our own testing but no new features added

3) Beta - when features are tweaked and fixed in response to user
suggestions but no new features added

How to say this short: when testing the latest syncrep patches I've
wasted time looking where the recv|fsync|apply api was, to find out it
was gone. *shrug* didn't need it for my use case. But for others it
might well be frustration to find out that what's currently called
syncrep cannot be configured in a way it's suitable, and that they might
have wasted considerable time while finding that out.

The dba interface for recv|fsync|apply seems to be pretty stable, so
supporting that for years should be without risk. How it works under the
hood - the beta period seems like *the* opportunity to attrach mayor
testing from all people waiting to get their hands on syncrep.

An aspect of a good product is that it doesn't waste users time, like a
good piece of code needs little comment. Alarm bells should go off when
somebody is about to write a large piece of documentation writing what a
feature doesn't support. It would be better to just support it
(recv|fsync|apply), or no syncrep at all. Syncrep is incomplete without it.

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

#25Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Yeb Havinga (#24)
Re: Additional options for Sync Replication

Yeb Havinga <yebhavinga@gmail.com> writes:

The dba interface for recv|fsync|apply seems to be pretty stable, so
supporting that for years should be without risk. How it works under the
hood - the beta period seems like *the* opportunity to attrach mayor testing
from all people waiting to get their hands on syncrep.

+1

It would be better to just support it (recv|fsync|apply),
or no syncrep at all. Syncrep is incomplete without it.

Agreed.

More than that, I think we should evaluate this patch on a cost/benefit
ratio, rather than trying to apply to it all those procedural fences
that we don't have, and that we don't have the size to benefit from.

This whole thread only managed to raise the cost of the feature, but
compared to its benefits, it's still a wash. Is there any good reason
that I missed to ask all our users to wait for at best another year to
get the SyncRep waiting behavior that makes sense and has been agreed on
for a very long time already?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

PS : we already tweaked the UI in such a way that controling this
feature from the standby makes no sense. What we talked about was
to setup on the master which durability level you need, and on each
standby which one you're able to provide. Then you mix&match.
That won't fly with current way to setup the sync standby list.

So current recv|fsync|apply patch is IMO finishing the 9.1 work.

#26Robert Haas
robertmhaas@gmail.com
In reply to: Dimitri Fontaine (#25)
Re: Additional options for Sync Replication

On Tue, Mar 29, 2011 at 3:49 AM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

It would be better to just support it (recv|fsync|apply),
or no syncrep at all. Syncrep is incomplete without it.

Agreed.

I have trouble viewing the idea that it would be better not to ship
sync rep at all than to add more features to it as a serious proposal.
Presumably, anyone who is sad that sync rep doesn't have all of the
options they might want would be even sadder to hear that we went
through a whole development cycle and ended up with nothing at all.
Even if we did agree to take this patch, there will certainly be more
features that someone might want and not have, such as the ability to
sync with multiple standbys at once.

More than that, I think we should evaluate this patch on a cost/benefit
ratio, rather than trying to apply to it all those procedural fences
that we don't have, and that we don't have the size to benefit from.

As a community, we've adopted a development plan that proceeds in
cycles. For the last several releases, we have had four CommitFests
in each release cycle, followed by a feature freeze and eventually by
beta and final release. It's certainly a valid question to ask how
well that procedure has served us. It does not seem likely to me that
we can continue to produce quality releases if we don't at some point
cut off the flow of new features into the source tree and work on
stabilizing the code we've already got, and I believe the point for
that was agreed by a large number of developers who sat in a room at
PGCon last year to be on or about February 15th. We ended up
extending that by a couple of weeks, to make sure that we had a
process that was FAIR: we didn't want patches that had been in the
pipeline for a very long time to get postponed to 9.2 because no
committer had had a chance to work on them yet. However, we also
bumped MANY patches to 9.2 because they weren't in sufficiently good
shape soon enough. If we accept this patch now because a bunch of
people say they really, really want it, isn't that unfair to the
people to whom we've already said "sorry, the deadline has passed"?

Of course, there is always going to be some gray area. I argued for
committing the replication_timeout patch because I believe the fact
that we haven't got that feature is almost a bug - it interferes
significantly with the usability of replication in general, and it
will be an even more serious problem with sync rep, where a hung
standby connection will not only mean that nothing is replicating but
also that no write transactions can be processed at all. However, you
could make the opposite argument - that it's really a new feature -
and therefore we ought not to commit it. So far no one has taken that
position, but it's certainly a reasonable argument. Likewise, there
is ongoing discussion on the collations thread about which of those
changes are necessary for this release, and which ones are things that
ought to be postponed to a future release. I haven't gotten too
involved in those discussions because I don't really understand the
underlying issues, but I think that's an important discussion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Robert Haas (#26)
Re: Additional options for Sync Replication

Robert Haas <robertmhaas@gmail.com> writes:

If we accept this patch now because a bunch of
people say they really, really want it, isn't that unfair to the
people to whom we've already said "sorry, the deadline has passed"?

No, because each time we're talking procedures we're forgetting about a
simple fact. Commiters have the direct responsibility of the code, that
is why pushing work from non-commiters takes so much time. Commiting
your own code, you don't have a steep learning curve, and you don't have
to understand the use case and get convinced.

So the rules are not the same for commiter patches and contributor
patches, and there's no good in trying to have them the same or
pretending they are. In particular, only commiters are able to finish
and polish the work between the last commit fest and beta, and then they
will be on the hook to get to release candidate and release.

But you know all that better than I do.

I don't want a release as soon as possible, I want the best we are able
to provide, and I think adding in current $subject patch helps reaching
this goal. <include "baring show stoppers" QA disclaimer>

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#26)
Re: Additional options for Sync Replication

On Tue, Mar 29, 2011 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

However, we also
bumped MANY patches to 9.2 because they weren't in sufficiently good
shape soon enough.  If we accept this patch now because a bunch of
people say they really, really want it, isn't that unfair to the
people to whom we've already said "sorry, the deadline has passed"?

Of course, there is always going to be some gray area.  I argued for
committing the replication_timeout patch because I believe the fact
that we haven't got that feature is almost a bug - it interferes
significantly with the usability of replication in general, and it
will be an even more serious problem with sync rep, where a hung
standby connection will not only mean that nothing is replicating but
also that no write transactions can be processed at all.  However, you
could make the opposite argument - that it's really a new feature -
and therefore we ought not to commit it.  So far no one has taken that
position, but it's certainly a reasonable argument.

The questions and issues you raise are real and I have no idea how to
judge them. All I know is that if we spent less time discussing
procedural issues, we'd get a lot more done and much more amicably
also.

I think "What makes PostgreSQL better?". I think about a rounded
feature set and treat that just as I would a bug. I know that's not
the same for everybody.

I'm happy there are people looking at replication timeouts also.
Regrettably I don't have enough time for everything I would like to
see.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#29Robert Haas
robertmhaas@gmail.com
In reply to: Dimitri Fontaine (#27)
Re: Additional options for Sync Replication

On Tue, Mar 29, 2011 at 10:48 AM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

So the rules are not the same for commiter patches and contributor
patches, and there's no good in trying to have them the same or
pretending they are.  In particular, only commiters are able to finish
and polish the work between the last commit fest and beta, and then they
will be on the hook to get to release candidate and release.

But you know all that better than I do.

Committers can and do get away with slipping things in later than
non-committers, and to some extent that's OK for the reasons you
mention. But Alvaro was very gracious in conceding that it was a bit
too late to push in his key lock patch, as was his employer, JD. They
didn't like it, but they accepted that it was necessary to move the
community, overall, forward, and to avoid a really long beta period
during which, really, nobody gets to do anything at all interesting.
We cannot have one standard for features that CommandPrompt really
wants committed and a different standard for features that 2ndQuadrant
or, say, EnterpriseDB, really want committed.

I completely disagree that committers are the only ones who can finish
and polish work between the last CommiFest and beta. Fujii Masao,
Kevin Grittner, Yeb Havinga, and Yamamoto Takashi all come to mind as
people who have been very, very helpful in moving us toward beta
through careful testing and code review. I have no fear at all about
our ability to maintain SSI even though there is not one committer who
fully understands it all, because every bug report that comes in gets
a response within hours and a patch within days. The limiting factor
there has actually been how long it's taken someone to look and test
those patches, not how quickly they've been produced. I think the
reality is exactly the other way around: committers are not the people
who get the opportunity to fix other people's bugs; they are the
people who are *expected* to fix other people's bugs when no one else
will. If it's your perception that the (mostly quite minor) changes
that I've made to sync rep are somehow for purposes of
self-aggrandizement or a desire to micromanage everything that happens
in the backend, then I'm sorry for that. I'll readily admit that I
have strong opinions on lots of topics, especially but not only
PostgreSQL-related topics; but I would be way happier to have spent
the last couple of weeks developing new features than swatting bugs.
Had I done that, though, I think that not as many bugs would have
gotten swatted. So I did it. Whether that makes me a helpful
community guy who tries to ensure a quality release or a total jerk
who interjects his nose into other people's business is, of course, a
matter of opinion.

Even today, anyone who would like to write a patch to address more
than one of the open items is more than welcome to do so, and I would
really appreciate it, even I or someone else ends up having to adjust
it a bit before committing. There are at least three issues on the
open items list that are obvious candidates for someone to pick up:

- fix attinhcount tracking
- Typed-tables patch broke pg_upgrade
- comments on SQL/MED objects

I volunteered to pick up the last one, but I'd be more than happy if
the person who reported the problem had already provided the patch.
Or if someone else wanted to write the patch. That would be awesome.
In my view, the question we should be asking ourselves here is not -
why are Tom and Robert getting to make all these commits? - but -
where is everybody else who should be helping out? If the answer is
"well we don't have time to work on this because we all have day jobs
we have to do to get paid", then I accept that. But that moves
getting to commit changes at a late date from the "privilege" bucket
into the "responsibility" bucket.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#29)
Re: Additional options for Sync Replication

On Tue, Mar 29, 2011 at 5:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 29, 2011 at 10:48 AM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

So the rules are not the same for commiter patches and contributor
patches, and there's no good in trying to have them the same or
pretending they are.  In particular, only commiters are able to finish
and polish the work between the last commit fest and beta, and then they
will be on the hook to get to release candidate and release.

But you know all that better than I do.

Committers can and do get away with slipping things in later than
non-committers, and to some extent that's OK for the reasons you
mention.  But Alvaro was very gracious in conceding that it was a bit
too late to push in his key lock patch, as was his employer, JD.  They
didn't like it, but they accepted that it was necessary to move the
community, overall, forward, and to avoid a really long beta period
during which, really, nobody gets to do anything at all interesting.
We cannot have one standard for features that CommandPrompt really
wants committed and a different standard for features that 2ndQuadrant
or, say, EnterpriseDB, really want committed.

I completely disagree that committers are the only ones who can finish
and polish work between the last CommiFest and beta.  Fujii Masao,
Kevin Grittner, Yeb Havinga, and Yamamoto Takashi all come to mind as
people who have been very, very helpful in moving us toward beta
through careful testing and code review.  I have no fear at all about
our ability to maintain SSI even though there is not one committer who
fully understands it all, because every bug report that comes in gets
a response within hours and a patch within days.  The limiting factor
there has actually been how long it's taken someone to look and test
those patches, not how quickly they've been produced.  I think the
reality is exactly the other way around: committers are not the people
who get the opportunity to fix other people's bugs; they are the
people who are *expected* to fix other people's bugs when no one else
will.  If it's your perception that the (mostly quite minor) changes
that I've made to sync rep are somehow for purposes of
self-aggrandizement or a desire to micromanage everything that happens
in the backend, then I'm sorry for that.  I'll readily admit that I
have strong opinions on lots of topics, especially but not only
PostgreSQL-related topics; but I would be way happier to have spent
the last couple of weeks developing new features than swatting bugs.
Had I done that, though, I think that not as many bugs would have
gotten swatted.  So I did it.  Whether that makes me a helpful
community guy who tries to ensure a quality release or a total jerk
who interjects his nose into other people's business is, of course, a
matter of opinion.

Even today, anyone who would like to write a patch to address more
than one of the open items is more than welcome to do so, and I would
really appreciate it, even I or someone else ends up having to adjust
it a bit before committing.  There are at least three issues on the
open items list that are obvious candidates for someone to pick up:

- fix attinhcount tracking
- Typed-tables patch broke pg_upgrade
- comments on SQL/MED objects

I volunteered to pick up the last one, but I'd be more than happy if
the person who reported the problem had already provided the patch.
Or if someone else wanted to write the patch.  That would be awesome.
In my view, the question we should be asking ourselves here is not -
why are Tom and Robert getting to make all these commits? - but -
where is everybody else who should be helping out?  If the answer is
"well we don't have time to work on this because we all have day jobs
we have to do to get paid", then I accept that.  But that moves
getting to commit changes at a late date from the "privilege" bucket
into the "responsibility" bucket.

Robert,

Everybody wants us to be polite and respectful with each other.

Writing such long emails seems to be just filibustering to me. I doubt
anyone has read and considered every word, there are just too many. A
form of disrespect.

Main thing I note is that you could have reviewed my patch in the time
its taken to discuss these procedural "issues". Why are they more
important?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#31Josh Berkus
josh@agliodbs.com
In reply to: Dimitri Fontaine (#27)
Re: Additional options for Sync Replication

On 3/29/11 7:48 AM, Dimitri Fontaine wrote:

I don't want a release as soon as possible, I want the best we are able
to provide, and I think adding in current $subject patch helps reaching
this goal. <include "baring show stoppers" QA disclaimer>

There will *always* be more work we can do on sync rep. If we hold the
release until we're done tinkering with it, we'll never release. Our
project has had a chronic issue with not being able to progress from
feature freeze to release in a timely fashion for years, and the sort of
argument expressed above isn't helping.

The relevant question is: is sync rep unusable by a large portion of our
users because this feature was stripped from what got committed, or is
it a value-add feature which makes synch rep nicer? If the former, it's
a bug and we should patch it; if the latter, it should wait until 9.2.

Writing such long emails seems to be just filibustering to me. I doubt
anyone has read and considered every word, there are just too many. A
form of disrespect.

Simon, Robert has been nothing but respectful to you. You can't accuse
him of being rude and hostile just because he disagrees with you.
Overselling his case, certainly. But very politely.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#32Christopher Browne
cbbrowne@gmail.com
In reply to: Josh Berkus (#31)
Re: Additional options for Sync Replication

On Tue, Mar 29, 2011 at 2:48 PM, Josh Berkus <josh@agliodbs.com> wrote:

Writing such long emails seems to be just filibustering to me. I doubt
anyone has read and considered every word, there are just too many. A
form of disrespect.

Simon, Robert has been nothing but respectful to you.  You can't accuse
him of being rude and hostile just because he disagrees with you.
Overselling his case, certainly.  But very politely.

If hostile's wanted, then we can come up with something hostile ;-).

Much better is to try to take a different perspective.

There are a number of features that were *turned down* for 9.1,
deferred for 9.2, because they weren't sufficiently ready at the
appointed time. There were 26 patches "Returned with Feedback"; there
are likely several of them which, if given the special handling given
to Synchronous Replication, could have been drawn into 9.1.

It's only fair to try to get the release out, so that these people
that have been waiting patiently have opportunity to get their
submissions into 9.2. The longer they wait, the more reason to get
hostile for better reason...
--
http://linuxfinances.info/info/linuxdistributions.html

#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#31)
Re: Additional options for Sync Replication

On Tue, Mar 29, 2011 at 7:48 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 3/29/11 7:48 AM, Dimitri Fontaine wrote:

I don't want a release as soon as possible, I want the best we are able
to provide, and I think adding in current $subject patch helps reaching
this goal.  <include "baring show stoppers" QA disclaimer>

There will *always* be more work we can do on sync rep.  If we hold the
release until we're done tinkering with it, we'll never release.  Our
project has had a chronic issue with not being able to progress from
feature freeze to release in a timely fashion for years, and the sort of
argument expressed above isn't helping.

The relevant question is: is sync rep unusable by a large portion of our
users because this feature was stripped from what got committed, or is
it a value-add feature which makes synch rep nicer?  If the former, it's
a bug and we should patch it; if the latter, it should wait until 9.2.

That's not relevant. Can something small be added for large benefit: yes!
It's a complete misrepresentation to suggest this would hold up the
release in any way and all the actual technical comments have been
easily dismissed. We've already used more time writing these emails
than it took me to write the patch (< 2 hours). This still can be
reviewed and committed easily.

This is not a new feature. It's something that's been in the design
for months, was submitted in the recent CF and is a minor change with
low risk.

Given we've all been here before, I've done my homework on what is
acceptable at this stage and this passes. There is nothing different
between this and the replication timeout patch. I'm not suggesting
that should be yanked as well, BTW.

And I'm not getting paid a single penny for this. Just me, trying to
add value for the PostgreSQL project.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#34Simon Riggs
simon@2ndQuadrant.com
In reply to: Josh Berkus (#31)
Re: Additional options for Sync Replication

On Tue, Mar 29, 2011 at 7:48 PM, Josh Berkus <josh@agliodbs.com> wrote:

Writing such long emails seems to be just filibustering to me. I doubt
anyone has read and considered every word, there are just too many. A
form of disrespect.

Simon, Robert has been nothing but respectful to you.  You can't accuse
him of being rude and hostile just because he disagrees with you.
Overselling his case, certainly.  But very politely.

I haven't accused Robert of being rude or hostile, when did I say that?
He seems perfectly polite to me.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#35Fujii Masao
masao.fujii@gmail.com
In reply to: Robert Haas (#26)
Re: Additional options for Sync Replication

On Tue, Mar 29, 2011 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 29, 2011 at 3:49 AM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

It would be better to just support it (recv|fsync|apply),
or no syncrep at all. Syncrep is incomplete without it.

Agreed.

I have trouble viewing the idea that it would be better not to ship
sync rep at all than to add more features to it as a serious proposal.
 Presumably, anyone who is sad that sync rep doesn't have all of the
options they might want would be even sadder to hear that we went
through a whole development cycle and ended up with nothing at all.
Even if we did agree to take this patch, there will certainly be more
features that someone might want and not have, such as the ability to
sync with multiple standbys at once.

More than that, I think we should evaluate this patch on a cost/benefit
ratio, rather than trying to apply to it all those procedural fences
that we don't have, and that we don't have the size to benefit from.

As a community, we've adopted a development plan that proceeds in
cycles.  For the last several releases, we have had four CommitFests
in each release cycle, followed by a feature freeze and eventually by
beta and final release.  It's certainly a valid question to ask how
well that procedure has served us.  It does not seem likely to me that
we can continue to produce quality releases if we don't at some point
cut off the flow of new features into the source tree and work on
stabilizing the code we've already got, and I believe the point for
that was agreed by a large number of developers who sat in a room at
PGCon last year to be on or about February 15th.  We ended up
extending that by a couple of weeks, to make sure that we had a
process that was FAIR: we didn't want patches that had been in the
pipeline for a very long time to get postponed to 9.2 because no
committer had had a chance to work on them yet.  However, we also
bumped MANY patches to 9.2 because they weren't in sufficiently good
shape soon enough.  If we accept this patch now because a bunch of
people say they really, really want it, isn't that unfair to the
people to whom we've already said "sorry, the deadline has passed"?

Of course, there is always going to be some gray area.  I argued for
committing the replication_timeout patch because I believe the fact
that we haven't got that feature is almost a bug - it interferes
significantly with the usability of replication in general, and it
will be an even more serious problem with sync rep, where a hung
standby connection will not only mean that nothing is replicating but
also that no write transactions can be processed at all.  However, you
could make the opposite argument - that it's really a new feature -
and therefore we ought not to commit it.  So far no one has taken that
position, but it's certainly a reasonable argument.  Likewise, there
is ongoing discussion on the collations thread about which of those
changes are necessary for this release, and which ones are things that
ought to be postponed to a future release.  I haven't gotten too
involved in those discussions because I don't really understand the
underlying issues, but I think that's an important discussion.

I'm very excited about new options, especially recv. But I agree with
Robert and Heikki because what the patch provides looks like new
feature rather than bug fix. And I think that we still require some
discussions of the design; how far transactions must wait for sync
rep in recv mode? In the patch, they wait for WAL to be written in
the standby, but I think that they should wait until walreceiver has
recieved WAL instead. That would increase the performance of sync
rep. Anyway, I don't think now is time to discuss about such a design
except for bug fix.

I like those additional options, but I believe that sync rep which we
worked out is still useful without them.

Replication timeout looks like a bug fix rather than new feature.
Without that, walsender might unexpectedly remain for a while when
the standby crashes or the network outage happens. As Robert said,
sync rep can get stuck for a long while because of such a remaining
walsender. What's the worse is that when hot_standby_feedback is
enabled, such a remaining walsender would prevent oldest xmin from
advancing and interfere with vacuuming on the master.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#35)
Re: Additional options for Sync Replication

On Wed, Mar 30, 2011 at 5:29 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I'm very excited about new options, especially recv. But I agree with
Robert and Heikki because what the patch provides looks like new
feature rather than bug fix. And I think that we still require some
discussions of the design; how far transactions must wait for sync
rep in recv mode? In the patch, they wait for WAL to be written in
the standby, but I think that they should wait until walreceiver has
recieved WAL instead. That would increase the performance of sync
rep. Anyway, I don't think now is time to discuss about such a design
except for bug fix.

Not waiting for write would just be much less safe and would not have
any purpose as a sync rep option.

The difference in time would be very marginal also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services