Patch for fail-back without fresh backup

Started by Samrat Revagadeover 12 years ago122 messages

revagade.samrat@gmail.com

over 12 years ago

1 attachment(s)

Hello,

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the
old master wants to be a standby of the new master, it can't just join the
setup since there is inconsistency in between these two servers. We need to
take the fresh backup from the new master. This can happen in both the
synchronous as well as asynchronous replication.

Fresh backup is also needed in case of clean switch-over because in the
current HEAD, the master does not wait for the standby to receive all the
WAL up to the shutdown checkpoint record before shutting down the
connection. Fujii Masao has already submitted a patch to handle clean
switch-over case, but the problem is still remaining for failback case.

The process of taking fresh backup is very time consuming when databases
are of very big sizes, say several TB's, and when the servers are connected
over a relatively slower link. This would break the service level
agreement of disaster recovery system. So there is need to improve the
process of disaster recovery in PostgreSQL. One way to achieve this is to
maintain consistency between master and standby which helps to avoid need
of fresh backup.

So our proposal on this problem is that we must ensure that master should
not make any file system level changes without confirming that the
corresponding WAL record is replicated to the standby.

There are many suggestions and objections pgsql-hackers about this problem
The brief summary is as follows:

1. The main objection was raised by Tom and others is that we should not
add this feature and should go with traditional way of taking fresh backup
using the rsync, because he was concerned about the additional complexity
of the patch and the performance overhead during normal operations.

2. Tom and others were also worried about the inconsistencies in the
crashed master and suggested that its better to start with a fresh backup.
Fujii Masao and others correctly countered that suggesting that we trust
WAL recovery to clear all such inconsistencies and there is no reason why
we can't do the same here.

3. Someone is suggested using rsync with checksum, but many pages on the
two servers may differ in their binary values because of hint bits etc.

4. The major objection for failback without fresh backup idea was it may
introduce on performance overhead and complexity to the code. By looking at
the patch I must say that patch is not too complex. For performance impact
I tested patch with pgbench which shows that it has very small performance
overhead. Please refer the test results included at end of mail.

*Proposal to solve the problem*

The proposal is based on the concept of master should not do any file
system level change until corresponding WAL record is replicated to the
standby.

There are many places in the code which need to be handled to support the
proposed solution. Following cases explains the need of fresh backup at
the time of failover, and how can we avoid this need by our approach.

1. We must not write any heap pages to the disk before the WAL records
corresponding to those changes are received by the standby. Otherwise if
standby failed to receive WAL corresponding to those heap pages there will
be inconsistency.

2. When CHECKPOINT happens on the master, control file of master gets
updated and last checkpoint record is written to it. Suppose failover
happens and standby fails to receive the WAL record corresponding to
CHECKPOINT, then master and standby has inconsistent copies of control file
that leads to the mismatch in redo record and recovery will not start
normally. To avoid this situation we must not update the control file of
master before the corresponding checkpoint WAL record is received by the
standby

3. Also when we truncate any of the physical files on the master and
suppose the standby failed to receive corresponding WAL, then that physical
file is truncated on master but still available on standby causing
inconsistency. To avoid this we must not truncate physical files on the
master before the WAL record corresponding to that operation is received by
the standby.

4. Same case applies to CLOG pages. If CLOG page is written to the disk and
corresponding WAL record is not replicated to the standby, leads to the
inconsistency. So we must not write the CLOG pages (and may be other SLRU
pages too) to the disk before the corresponding WAL records are received by
standby.

5. The same problem applies for the commit hint bits. But it is more
complicated than the other problems, because no WAL records are generated
for that, hence we cannot apply the same above method, that is wait for
corresponding WAL record to be replicated on standby. So we delay the
processes of updating the commit hint bits, similar to what is done by
asynchronous commits. In other words we need to check if the WAL
corresponding to the transaction commit is received by the failback safe
standby and then only allow hint bit updates.

*Patch explanation:*

The initial work on this patch is done by Pavan Deolasee. I tested it and
will make further enhancements based on the community feedback.

This patch is not complete yet, but I plan to do so with the help of this
community. At this point, the primary purpose is to understand the
complexities and get some initial performance numbers to alleviate some of
the concerns raised by the community.

There are two GUC parameters which supports this failsafe standby

1. failback_safe_standby_name [ name of the failsafe standby ] It is the
name of failsafe standby. Master will not do any file system level change
before corresponding WAL is replicated on the this failsafe standby

2. failback_safe_standby_mode [ off/remote_write/remote_flush] This
parameter specifies the behavior of master i.e. whether it should wait for
WAL to be written on standby or WAL to be flushed on standby. We should
turn it off when we do not want the failsafe standby. This failsafe mode
can be combined with synchronous as well as asynchronous streaming
replication.

Most of the changes are done in the syncrep.c. This is a slight misnomer
because that file deals with synchronous standby and a failback standby
could and most like be a async standby. But keeping the changes this way
has ensured that the patch is easy to read. Once we have acceptance on the
approach, the patch can be modified to reorganize the code in a more
logical way.

The patch adds a new state SYNC_REP_WAITING_FOR_FAILBACK_SAFETY to the sync
standby states. A backend which is waiting for a failback safe standby to
receive WAL records, will wait in this state. Failback safe mechanism can
work in two different modes, that is wait for WAL to be written or flushed
on failsafe standby. That is represented by two new modes
SYNC_REP_WAIT_FAILBACK_SAFE_WRITE and SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH
respectively.

Also the SyncRepWaitForLSN() is changed for conditional wait. So that we
can delay hint bit updates on master instead of blocking the wait for the
failback safe standby to receiver WAL's.

*Benchmark tests*

*PostgreSQL versions:* PostgreSQL 9.3beta1

*Usage:* For operating in failsafe mode you need to configure following two
GUC parameters:

1. failback_safe_standby_name

2.failback_safe_standby_mode

*Performance impact:*

The test are performed on the servers having 32 GB RAM, checkpoint_timeout
is set to 10 minutes so that checkpoint will happen more frequently.
Checkpoint involves flushing all the dirty blocks to the disk and we wanted
to primarily test that code path.

pgbech settings:

Transaction type: TPC-B

Scaling factor: 100

Query mode: simple

Number of clients: 100

Number of threads: 1

Duration: 1800 s

Following table shows the average TPS measured for each scenario. We
conducted 3 tests for each scenario

1) Synchronous Replication - 947 tps

2) Synchronous Replication + Failsafe standby (off) - 934 tps

3) Synchronous Replication + Failsafe standby (remote_flush) - 931 tps

4) Asynchronous Replication - 1369 tps

5) Asynchronous Replication + Failsafe standby (off) - 1349 tps

6) Asynchronous Replication + Failsafe standby (remote_flush) - 1350 tps

By observing the table we can conclude following:

1. Streaming replication + failback safe:

a) On an average, synchronous replication combined with failsafe standby
(remote_flush) causes 1.68 % performance overhead.

b) On an average, asynchronous streaming replication combined with
failsafe standby (remote_flush) causes averagely 1.38 % performance
degradation.

2. Streaming replication + failback safe (turned off):

a) Averagely synchronous replication combined with failsafe standby

(off) causes 1.37 % performance overhead.

b) Averagely asynchronous streaming replication combined with failsafe
standby (off) causes averagely 1.46 % performance degradation.

So the patch is showing 1-2% performance overhead.

Please give your suggestions if there is a need to perform tests for other
scenario.

*Improvements (To-do):*

1. Currently this patch supports only one failback safe standby. It can
either be synchronous or an asynchronous standby. We probably need to
discuss whether it needs to be changed for support of multiple failsafe
standby's.

2. Current design of patch will wait forever for the failback safe standby.
Streaming replication also has same limitation. We probably need to discuss
whether it needs to be changed.

There are couples of more places that probably need some attention and I
have marked them with XXX

Thank you,

Samrat

Attachments:

failback_safe_standby.patchapplication/octet-stream; name=failback_safe_standby.patchDownload

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index cb95aa3..a5bfb9a 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -723,6 +723,8 @@ WriteTruncateXlogRec(int pageno)
 	rdata.next = NULL;
 	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
 	XLogFlush(recptr);
+
+	/* XXX Do we need wait for the failback safe standby ? */
 }
 
 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5a8f654..cb0ddea 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -700,6 +700,12 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 			START_CRIT_SECTION();
 			XLogFlush(max_lsn);
 			END_CRIT_SECTION();
+
+			/*
+			 * Also wait for the failback safe standby to receive WAL upto
+			 * max_lsn.
+			 */
+			SyncRepWaitForLSN(max_lsn, true, true);
 		}
 	}
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e975f8d..f7f95d7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1096,7 +1096,7 @@ EndPrepare(GlobalTransaction gxact)
 	 * Note that at this stage we have marked the prepare, but still show as
 	 * running in the procarray (twice!) and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(gxact->prepare_lsn);
+	SyncRepWaitForLSN(gxact->prepare_lsn, false, true);
 
 	records.tail = records.head = NULL;
 }
@@ -2063,7 +2063,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
 
 /*
@@ -2143,5 +2143,5 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 31e868d..359d9c9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1195,7 +1195,7 @@ RecordTransactionCommit(void)
 	 * in the procarray and continue to hold locks.
 	 */
 	if (wrote_xlog)
-		SyncRepWaitForLSN(XactLastRecEnd);
+		SyncRepWaitForLSN(XactLastRecEnd, false, true);
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	XactLastRecEnd = 0;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40b780c..69dda0d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7061,6 +7061,13 @@ CreateCheckPoint(int flags)
 	XLogFlush(recptr);
 
 	/*
+	 * At this point, ensure that the failback safe standby has received the
+	 * checkpoint WAL. Otherwise failure after the control file update will
+	 * cause the master to start from a location not known to the standby
+	 */
+	SyncRepWaitForLSN(recptr, true, true);
+
+	/*
 	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
 	 * overwritten at next startup.  No-one should even try, this just allows
 	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 971a149..2eea92d 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -288,6 +288,13 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		 */
 		if (fsm || vm)
 			XLogFlush(lsn);
+
+		/* 
+		 * Also ensure that the WAL is received by the failback safe standby.
+		 * Otherwise, we may have a situation where the heap is truncated, but
+		 * the action never replayed on the standby
+		 */
+		SyncRepWaitForLSN(lsn, true, true);
 	}
 
 	/* Do the real work */
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 5424281..727d107 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -59,17 +59,27 @@
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
 
+/* User-settable parameter for failback safe standby */
+char	   *FailbackSafeStandbyName;
+
 #define SyncStandbysDefined() \
 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
 
+#define FailbackSafeStandbyDefined() \
+	(FailbackSafeStandbyName != NULL && FailbackSafeStandbyName[0] != '\0')
+
 static bool announce_next_takeover = true;
 
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+static int	FailbackSafeRepWaitMode = SYNC_REP_NO_WAIT;
+int			failback_safety = FAILBACK_SAFETY_OFF;
+
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 
 static int	SyncRepGetStandbyPriority(void);
+static bool	SyncRepCheckIfFailbackSafe(void);
 
 #ifdef USE_ASSERT_CHECKING
 static bool SyncRepQueueIsOrderedByLSN(int mode);
@@ -82,28 +92,52 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
- * Wait for synchronous replication, if requested by user.
+ * Wait for synchronous/failback_safe replication, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
- * change that state to SYNC_REP_WAITING before adding ourselves
- * to the wait queue. During SyncRepWakeQueue() a WALSender changes
- * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
- * This backend then resets its state to SYNC_REP_NOT_WAITING.
+ * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_FAILBACK_SAFETY
+ * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
+ * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
+ * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
+ *
+ * ForFailbackSafety - if TRUE, we wait for the failback safe standby.
+ * Otherwise wait for the sync standby
+ *
+ * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
+ * the standby has already made progressed upto the given XactCommitLSN
+ *
+ * Return TRUE if either the sync/failback_safe standby is not
+ * configured/turned off OR the standby has made enough progress
  */
-void
-SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+bool
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForFailbackSafety, bool Wait)
 {
 	char	   *new_status = NULL;
 	const char *old_status;
-	int			mode = SyncRepWaitMode;
+	int			mode = !ForFailbackSafety ? SyncRepWaitMode : FailbackSafeRepWaitMode;
+	bool		ret;
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
 	 * sync replication standby names defined. Note that those standbys don't
 	 * need to be connected.
 	 */
-	if (!SyncRepRequested() || !SyncStandbysDefined())
-		return;
+	if ((!SyncRepRequested() || !SyncStandbysDefined()) &&
+		!ForFailbackSafety)
+		return true;
+
+	/*
+	 * If the caller has specified ForFailbackSafety, but failback_safe_standby
+	 * is not specified or its turned off, exit.
+	 *
+	 * We would like to allow the failback safe mechanism even for cascaded
+	 * standbys as well. But we can't really wait for the standby to catch
+	 * up until we reach a consistent state since the standbys won't be
+	 * even able to connect without us reaching in that state (XXX Confirm)
+	 */
+	if ((!FailbackSafeRepRequested() || !FailbackSafeStandbyDefined() ||
+			!reachedConsistency) && ForFailbackSafety)
+		return true;
 
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 	Assert(WalSndCtl != NULL);
@@ -119,19 +153,28 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * condition but we'll be fetching that cache line anyway so its likely to
 	 * be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if ((!ForFailbackSafety && !WalSndCtl->sync_standbys_defined) ||
+		(ForFailbackSafety && !WalSndCtl->failback_safe_standby_defined) ||
 		XactCommitLSN <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
-		return;
+		return true;
 	}
 
 	/*
+	 * Exit if we are told not to block on the standby.
+	 */
+	if (!Wait)
+		return false;
+
+	/*
 	 * Set our waitLSN so WALSender will know when to wake us, and add
 	 * ourselves to the queue.
 	 */
 	MyProc->waitLSN = XactCommitLSN;
-	MyProc->syncRepState = SYNC_REP_WAITING;
+	MyProc->syncRepState = !ForFailbackSafety ?
+							SYNC_REP_WAITING :
+							SYNC_REP_WAITING_FOR_FAILBACK_SAFETY;
 	SyncRepQueueInsert(mode);
 	Assert(SyncRepQueueIsOrderedByLSN(mode));
 	LWLockRelease(SyncRepLock);
@@ -150,6 +193,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		new_status[len] = '\0'; /* truncate off " waiting ..." */
 	}
 
+	ret = false;
 	/*
 	 * Wait for specified LSN to be confirmed.
 	 *
@@ -179,14 +223,18 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		 * contained memory barriers.
 		 */
 		syncRepState = MyProc->syncRepState;
-		if (syncRepState == SYNC_REP_WAITING)
+		if (syncRepState == SYNC_REP_WAITING ||
+			syncRepState == SYNC_REP_WAITING_FOR_FAILBACK_SAFETY)
 		{
 			LWLockAcquire(SyncRepLock, LW_SHARED);
 			syncRepState = MyProc->syncRepState;
 			LWLockRelease(SyncRepLock);
 		}
 		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+		{
+			ret = true;
 			break;
+		}
 
 		/*
 		 * If a wait for synchronous replication is pending, we can neither
@@ -263,6 +311,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		set_ps_display(new_status, false);
 		pfree(new_status);
 	}
+
+	return ret;
 }
 
 /*
@@ -339,7 +389,7 @@ void
 SyncRepInitConfig(void)
 {
 	int			priority;
-
+	bool		is_failback_safe;
 	/*
 	 * Determine if we are a potential sync standby and remember the result
 	 * for handling replies from standby.
@@ -354,6 +404,18 @@ SyncRepInitConfig(void)
 			(errmsg("standby \"%s\" now has synchronous standby priority %u",
 					application_name, priority)));
 	}
+
+	is_failback_safe = SyncRepCheckIfFailbackSafe();
+	if (MyWalSnd->is_failback_safe != is_failback_safe)
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		MyWalSnd->is_failback_safe = is_failback_safe;
+		LWLockRelease(SyncRepLock);
+		ereport(DEBUG1,
+			(errmsg("standby \"%s\" is a failback safe standby",
+					application_name)));
+
+	}
 }
 
 /*
@@ -368,8 +430,11 @@ SyncRepReleaseWaiters(void)
 {
 	volatile WalSndCtlData *walsndctl = WalSndCtl;
 	volatile WalSnd *syncWalSnd = NULL;
+	volatile WalSnd *failbackSafeWalSnd = NULL;
 	int			numwrite = 0;
 	int			numflush = 0;
+	int			numwrite_fbs = 0;
+	int			numflush_fbs = 0;
 	int			priority = 0;
 	int			i;
 
@@ -379,7 +444,7 @@ SyncRepReleaseWaiters(void)
 	 * up, still running base backup or the current flush position is still
 	 * invalid, then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
+	if ((MyWalSnd->sync_standby_priority == 0 && !MyWalSnd->is_failback_safe) ||
 		MyWalSnd->state < WALSNDSTATE_STREAMING ||
 		XLogRecPtrIsInvalid(MyWalSnd->flush))
 		return;
@@ -398,58 +463,90 @@ SyncRepReleaseWaiters(void)
 		volatile WalSnd *walsnd = &walsndctl->walsnds[i];
 
 		if (walsnd->pid != 0 &&
-			walsnd->state == WALSNDSTATE_STREAMING &&
-			walsnd->sync_standby_priority > 0 &&
-			(priority == 0 ||
-			 priority > walsnd->sync_standby_priority) &&
-			!XLogRecPtrIsInvalid(walsnd->flush))
+			walsnd->state == WALSNDSTATE_STREAMING)
 		{
-			priority = walsnd->sync_standby_priority;
-			syncWalSnd = walsnd;
+			if (walsnd->sync_standby_priority > 0 &&
+				(priority == 0 ||
+			 	priority > walsnd->sync_standby_priority) &&
+				!XLogRecPtrIsInvalid(walsnd->flush))
+			{
+				priority = walsnd->sync_standby_priority;
+				syncWalSnd = walsnd;
+			}
+
+			if (walsnd->is_failback_safe)
+				failbackSafeWalSnd = walsnd;
 		}
 	}
 
 	/*
 	 * We should have found ourselves at least.
 	 */
-	Assert(syncWalSnd);
+	Assert(syncWalSnd || failbackSafeWalSnd);
 
 	/*
 	 * If we aren't managing the highest priority standby then just leave.
 	 */
 	if (syncWalSnd != MyWalSnd)
 	{
-		LWLockRelease(SyncRepLock);
 		announce_next_takeover = true;
-		return;
+		if (!failbackSafeWalSnd)
+		{
+			LWLockRelease(SyncRepLock);
+			return;
+		}
 	}
-
 	/*
 	 * Set the lsn first so that when we wake backends they will release up to
 	 * this location.
 	 */
-	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
+	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write &&
+		syncWalSnd == MyWalSnd)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
 		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
 	}
-	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
+	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush &&
+		syncWalSnd == MyWalSnd)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
 		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
 	}
+	if (walsndctl->lsn[SYNC_REP_WAIT_FAILBACK_SAFE_WRITE] < MyWalSnd->write &&
+		failbackSafeWalSnd == MyWalSnd)
+	{
+		walsndctl->lsn[SYNC_REP_WAIT_FAILBACK_SAFE_WRITE] = MyWalSnd->write;
+		numwrite_fbs = SyncRepWakeQueue(false, SYNC_REP_WAIT_FAILBACK_SAFE_WRITE);
+	}
+	if (walsndctl->lsn[SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH] < MyWalSnd->flush &&
+		failbackSafeWalSnd == MyWalSnd)
+	{
+		walsndctl->lsn[SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH] = MyWalSnd->flush;
+		numflush_fbs = SyncRepWakeQueue(false, SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH);
+	}
 
 	LWLockRelease(SyncRepLock);
 
-	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
-		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
-	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
+	if (syncWalSnd == MyWalSnd)
+	{
+		elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
+				numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
+				numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
+	}
+
+	if (failbackSafeWalSnd == MyWalSnd)
+	{
+		elog(DEBUG3, "released %d procs up to write for failback safety %X/%X,"
+				" %d procs up to flush for failback safety %X/%X", 
+				numwrite_fbs, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
+				numflush_fbs, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
+	}
 
 	/*
 	 * If we are managing the highest priority standby, though we weren't
 	 * prior to this, then announce we are now the sync standby.
 	 */
-	if (announce_next_takeover)
+	if ((announce_next_takeover) && (syncWalSnd == MyWalSnd))
 	{
 		announce_next_takeover = false;
 		ereport(LOG,
@@ -515,6 +612,22 @@ SyncRepGetStandbyPriority(void)
 	return (found ? priority : 0);
 }
 
+
+/*
+ * Check if we are a failback safe standby
+ *
+ * Compare the parameter SyncRepStandbyNames against the application_name
+ * for this WALSender, or allow any name if we find a wildcard "*".
+ */
+static bool
+SyncRepCheckIfFailbackSafe(void)
+{
+	if (pg_strcasecmp(FailbackSafeStandbyName, application_name) == 0 ||
+			pg_strcasecmp(FailbackSafeStandbyName, "*") == 0)
+		return true;
+	else
+		return false;
+}
 /*
  * Walk the specified queue from head.	Set the state of any backends that
  * need to be woken, remove them from the queue, and then wake them.
@@ -588,8 +701,10 @@ void
 SyncRepUpdateSyncStandbysDefined(void)
 {
 	bool		sync_standbys_defined = SyncStandbysDefined();
+	bool		failback_safe_standby_defined = FailbackSafeStandbyDefined();
 
-	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+	if ((sync_standbys_defined != WalSndCtl->sync_standbys_defined) ||
+		(failback_safe_standby_defined != WalSndCtl->failback_safe_standby_defined))
 	{
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
@@ -600,10 +715,14 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 */
 		if (!sync_standbys_defined)
 		{
-			int			i;
+			SyncRepWakeQueue(true, SYNC_REP_WAIT_WRITE);
+			SyncRepWakeQueue(true, SYNC_REP_WAIT_FLUSH);
+		}
 
-			for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
-				SyncRepWakeQueue(true, i);
+		if (!failback_safe_standby_defined)
+		{
+			SyncRepWakeQueue(true, SYNC_REP_WAIT_FAILBACK_SAFE_WRITE);
+			SyncRepWakeQueue(true, SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH);
 		}
 
 		/*
@@ -614,6 +733,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * the queue (and never wake up).  This prevents that.
 		 */
 		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
+		WalSndCtl->failback_safe_standby_defined = failback_safe_standby_defined;
 
 		LWLockRelease(SyncRepLock);
 	}
@@ -709,3 +829,20 @@ assign_synchronous_commit(int newval, void *extra)
 			break;
 	}
 }
+
+void
+assign_failback_safety(int newval, void *extra)
+{
+	switch (newval)
+	{
+		case FAILBACK_SAFETY_REMOTE_WRITE:
+			FailbackSafeRepWaitMode = SYNC_REP_WAIT_FAILBACK_SAFE_WRITE;
+			break;
+		case FAILBACK_SAFETY_REMOTE_FLUSH:
+			FailbackSafeRepWaitMode = SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH;
+			break;
+		default:
+			FailbackSafeRepWaitMode = SYNC_REP_NO_WAIT;
+			break;
+	}
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 43eb7d5..8fa4aed 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -41,6 +41,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -1978,6 +1979,13 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 		XLogFlush(recptr);
 
 	/*
+	 * If failback safe standby is defined, also ensure that the WAL is
+	 * received by the standby before we write to the disk
+	 */
+	if (buf->flags & BM_PERMANENT)
+		SyncRepWaitForLSN(recptr, true, true);
+
+	/*
 	 * Now it's safe to write buffer to disk. Note that no one else should
 	 * have been able to write it while we were busy with log flushing because
 	 * we have the io_in_progress lock.
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 2c7d9f3..be961fd 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -722,6 +722,11 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 
 		/* As always, WAL must hit the disk before the data update does */
 		XLogFlush(lsn);
+
+		/* 
+		 * XXX Should we also wait for the failback safe standby to receive the
+		 * WAL ?
+		 */
 	}
 
 	errno = 0;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..612a6a2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -379,6 +379,20 @@ static const struct config_enum_entry synchronous_commit_options[] = {
 };
 
 /*
+ * Although only "off", "remote_write", and "remote_flush" are documented, we
+ * accept all the likely variants of "off".
+ */
+static const struct config_enum_entry failback_safety_options[] = {
+	{"remote_write", FAILBACK_SAFETY_REMOTE_WRITE, false},
+	{"remote_flush", FAILBACK_SAFETY_REMOTE_FLUSH, false},
+	{"off", FAILBACK_SAFETY_OFF, false},
+	{"false", FAILBACK_SAFETY_OFF, true},
+	{"no", FAILBACK_SAFETY_OFF, true},
+	{"0", FAILBACK_SAFETY_OFF, true},
+	{NULL, 0, false}
+};
+
+/*
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
@@ -3067,6 +3081,16 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"failback_safe_standby_name", PGC_SIGHUP, REPLICATION_MASTER,
+			gettext_noop("Name of potential failback safe standby."),
+			NULL
+		},
+		&FailbackSafeStandbyName,
+		"",
+		NULL, NULL, NULL
+	},
+
+	{
 		{"default_text_search_config", PGC_USERSET, CLIENT_CONN_LOCALE,
 			gettext_noop("Sets default text search configuration."),
 			NULL
@@ -3252,6 +3276,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"failback_safe_standby_mode", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Turns failback safety on/off and sets the level"),
+			NULL
+		},
+		&failback_safety,
+		FAILBACK_SAFETY_OFF, failback_safety_options,
+		NULL, assign_failback_safety, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..c1def9e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -212,6 +212,11 @@
 #wal_keep_segments = 0		# in logfile segments, 16MB each; 0 disables
 #wal_sender_timeout = 60s	# in milliseconds; 0 disables
 
+#failback_safe_standby_mode = off	# failback safety level
+							# off, remote_write or remote_flush 
+#failback_safe_standby_name = ''	# standby server that is guaranteed to be
+				# failback safe
+
 # - Master Server -
 
 # These settings are ignored on a standby server.
@@ -219,6 +224,7 @@
 #synchronous_standby_names = ''	# standby servers that provide sync rep
 				# comma-separated list of application_name
 				# from standby(s); '*' = all
+
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
 # - Standby Servers -
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index ab4020a..14ccfe5 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -62,6 +62,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
 #include "utils/tqual.h"
@@ -118,6 +119,15 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 
 		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
 			return;				/* not flushed yet, so don't set hint */
+
+		/* 
+		 * If failback safe standby is configured, we should also check
+		 * if the commit WAL record has made to the standby before allowing
+		 * hint bit updates. We should not wait for the standby to receive the
+		 * WAL since its OK to delay hint bit updates
+		 */
+		if (!SyncRepWaitForLSN(commitLSN, true, false))
+			return;
 	}
 
 	tuple->t_infomask |= infomask;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ac23ea6..35a5212 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -19,23 +19,41 @@
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
 
+#define FailbackSafeRepRequested() \
+	(max_wal_senders > 0 && failback_safety > FAILBACK_SAFETY_OFF)
+
 /* SyncRepWaitMode */
-#define SYNC_REP_NO_WAIT		-1
-#define SYNC_REP_WAIT_WRITE		0
-#define SYNC_REP_WAIT_FLUSH		1
+#define SYNC_REP_NO_WAIT					-1
+#define SYNC_REP_WAIT_WRITE					0
+#define SYNC_REP_WAIT_FLUSH					1
+#define SYNC_REP_WAIT_FAILBACK_SAFE_WRITE	2
+#define SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH	3
 
-#define NUM_SYNC_REP_WAIT_MODE	2
+#define NUM_SYNC_REP_WAIT_MODE				4
 
 /* syncRepState */
-#define SYNC_REP_NOT_WAITING		0
-#define SYNC_REP_WAITING			1
-#define SYNC_REP_WAIT_COMPLETE		2
+#define SYNC_REP_NOT_WAITING					0
+#define SYNC_REP_WAITING						1
+#define SYNC_REP_WAITING_FOR_FAILBACK_SAFETY	2
+#define SYNC_REP_WAIT_COMPLETE					3
+
+typedef enum
+{
+	FAILBACK_SAFETY_OFF,		/* no failback safety */
+	FAILBACK_SAFETY_REMOTE_WRITE,		/* wait for remote write only */
+	FAILBACK_SAFETY_REMOTE_FLUSH		/* wait for remote flush */
+}	FailbackSafetyLevel;
 
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
 
+/* user-settable parameters for failback safe replication */
+extern char *FailbackSafeStandbyName;
+extern int	failback_safety;
+
 /* called by user backend */
-extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
+		bool ForFailbackSafety, bool Wait);
 
 /* called at backend exit */
 extern void SyncRepCleanupAtProcExit(void);
@@ -52,5 +70,6 @@ extern int	SyncRepWakeQueue(bool all, int mode);
 
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_commit(int newval, void *extra);
+extern void assign_failback_safety(int newval, void *extra);
 
 #endif   /* _SYNCREP_H */
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7eaa21b..0142c0f 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -62,6 +62,9 @@ typedef struct WalSnd
 	 * SyncRepLock.
 	 */
 	int			sync_standby_priority;
+
+	/* Track if we are serving a failback safe standby */
+	bool		is_failback_safe;
 } WalSnd;
 
 extern WalSnd *MyWalSnd;
@@ -88,6 +91,13 @@ typedef struct
 	 */
 	bool		sync_standbys_defined;
 
+	/*
+	 * Is any failback safe standby defined?  Waiting backends can't reload the
+	 * config file safely, so checkpointer updates this value as needed.
+	 * Protected by SyncRepLock.
+	 */
+	bool		failback_safe_standby_defined;
+
 	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
 } WalSndCtlData;

Benedikt Grundmann

bgrundmann@janestreet.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 10:11 AM, Samrat Revagade <revagade.samrat@gmail.com

wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the
old master wants to be a standby of the new master, it can't just join the
setup since there is inconsistency in between these two servers. We need to
take the fresh backup from the new master. This can happen in both the
synchronous as well as asynchronous replication.

Fresh backup is also needed in case of clean switch-over because in the
current HEAD, the master does not wait for the standby to receive all the
WAL up to the shutdown checkpoint record before shutting down the
connection. Fujii Masao has already submitted a patch to handle clean
switch-over case, but the problem is still remaining for failback case.

The process of taking fresh backup is very time consuming when databases
are of very big sizes, say several TB's, and when the servers are connected
over a relatively slower link. This would break the service level
agreement of disaster recovery system. So there is need to improve the
process of disaster recovery in PostgreSQL. One way to achieve this is to
maintain consistency between master and standby which helps to avoid need
of fresh backup.

So our proposal on this problem is that we must ensure that master should
not make any file system level changes without confirming that the
corresponding WAL record is replicated to the standby.

A alternative proposal (which will probably just reveal my lack of
understanding about what is or isn't possible with WAL). Provide a way to
restart the master so that it rolls back the WAL changes that the slave
hasn't seen.

Show quoted text

There are many suggestions and objections pgsql-hackers about this problem
The brief summary is as follows:

Samrat Revagade

revagade.samrat@gmail.com

over 12 years ago

In reply to: Benedikt Grundmann (#2)

Re: Patch for fail-back without fresh backup

That will not happen if there is inconsistency in between both the servers.

Please refer to the discussions on the link provided in the first post:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Regards,

Samrat Revgade

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

On 14.06.2013 12:11, Samrat Revagade wrote:

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the
old master wants to be a standby of the new master, it can't just join the
setup since there is inconsistency in between these two servers. We need to
take the fresh backup from the new master. This can happen in both the
synchronous as well as asynchronous replication.

Did you see the thread on the little tool I wrote called pg_rewind?

/messages/by-id/519DF910.4020609@vmware.com

It solves that problem, for both clean and unexpected shutdown. It needs
some more work and a lot more testing, but requires no changes to the
backend. Robert Haas pointed out in that thread that it has a problem
with hint bits that are not WAL-logged, but it will still work if you
also enable the new checksums feature, which forces hint bit updates to
be WAL-logged. Perhaps we could add a GUC to enable hint bits to be
WAL-logged, regardless of checksums, to make pg_rewind work.

I think that's a more flexible approach to solve this problem. It
doesn't require an online feedback loop from the standby to master, for
starters.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Pavan Deolasee

pavan.deolasee@gmail.com

over 12 years ago

In reply to: Heikki Linnakangas (#4)

Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

Robert Haas pointed out in that thread that it has a problem with hint
bits that are not WAL-logged,

I liked that tool a lot until Robert pointed out the above problem. I
thought this is a show stopper because I can't really see any way to
circumvent it unless we enable checksums or explicitly WAL log hint bits.

but it will still work if you also enable the new checksums feature, which
forces hint bit updates to be WAL-logged.

Are we expecting a lot of people to run their clusters with checksums on ?
Sorry, I haven't followed the checksum discussions and don't know how much
overhead it causes. But if the general expectation is that checksums will
be turned on most often, I agree pg_rewind is probably good enough.

Perhaps we could add a GUC to enable hint bits to be WAL-logged,
regardless of checksums, to make pg_rewind work.

Wouldn't that be too costly ? I mean, in the worst case every hint bit on a
page may get updated separately. If each such update is WAL logged, we are
looking for a lot more unnecessary WAL traffic.

I think that's a more flexible approach to solve this problem. It doesn't
require an online feedback loop from the standby to master, for starters.

I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work
with 9.3 and below because of the hint bits issue, otherwise it would have
been even more cool.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Pavan Deolasee

pavan.deolasee@gmail.com

over 12 years ago

In reply to: Benedikt Grundmann (#2)

Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 2:51 PM, Benedikt Grundmann <
bgrundmann@janestreet.com> wrote:

A alternative proposal (which will probably just reveal my lack of
understanding about what is or isn't possible with WAL). Provide a way to
restart the master so that it rolls back the WAL changes that the slave
hasn't seen.

WAL records in PostgreSQL can only be used for physical redo. They can not
be used for undo. So what you're suggesting is not possible though I am
sure a few other databases do that.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Pavan Deolasee (#5)

Re: Patch for fail-back without fresh backup

On 14.06.2013 14:06, Pavan Deolasee wrote:

On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangas<hlinnakangas@vmware.com

wrote:

Robert Haas pointed out in that thread that it has a problem with hint
bits that are not WAL-logged,

I liked that tool a lot until Robert pointed out the above problem. I
thought this is a show stopper because I can't really see any way to
circumvent it unless we enable checksums or explicitly WAL log hint bits.

but it will still work if you also enable the new checksums feature, which
forces hint bit updates to be WAL-logged.

Are we expecting a lot of people to run their clusters with checksums on ?
Sorry, I haven't followed the checksum discussions and don't know how much
overhead it causes. But if the general expectation is that checksums will
be turned on most often, I agree pg_rewind is probably good enough.

Well, time will tell I guess. The biggest overhead with the checksums is
exactly the WAL-logging of hint bits.

Perhaps we could add a GUC to enable hint bits to be WAL-logged,
regardless of checksums, to make pg_rewind work.

Wouldn't that be too costly ? I mean, in the worst case every hint bit on a
page may get updated separately. If each such update is WAL logged, we are
looking for a lot more unnecessary WAL traffic.

Yep, same as with checksums. I was not very enthusiastic about the
checksums patch because of that, but a lot of people are willing to pay
that price. Maybe we can figure out a way to reduce that cost in 9.4.
It'd benefit the checksums greatly.

For pg_rewind, we wouldn't actually need a full-page image for hint bit
updates, just a small record saying "hey, I touched this page". And
you'd only need to write that the first time a page is touched after a
checkpoint.

I think that's a more flexible approach to solve this problem. It doesn't
require an online feedback loop from the standby to master, for starters.

I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work
with 9.3 and below because of the hint bits issue, otherwise it would have
been even more cool.

The proposed patch is clearly not 9.3 material either. If anything,
there's a much better change that we could still sneak in a GUC to allow
hint bits to be WAL-logged without checksums in 9.3. All the code is
there, it'd just be a new guc to control it separetely from checksums.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Greg Stark

stark@mit.edu

over 12 years ago

In reply to: Heikki Linnakangas (#7)

Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 12:20 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

For pg_rewind, we wouldn't actually need a full-page image for hint bit
updates, just a small record saying "hey, I touched this page". And you'd
only need to write that the first time a page is touched after a checkpoint.

I would expect that to be about the same cost though. The latency for
the fsync on the wal record before being able to flush the buffer is
the biggest cost.

The proposed patch is clearly not 9.3 material either. If anything, there's
a much better change that we could still sneak in a GUC to allow hint bits
to be WAL-logged without checksums in 9.3. All the code is there, it'd just
be a new guc to control it separetely from checksums.

On the other hand if you're going to wal log the hint bits why not
enable checksums?

Do we allow turning off checksums after a database is initdb'd? IIRC
we can't turn it on later but I don't see why we couldn't turn them
off.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Heikki Linnakangas (#7)

Re: Patch for fail-back without fresh backup

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Well, time will tell I guess. The biggest overhead with the checksums is
exactly the WAL-logging of hint bits.

Refresh my memory as to why we need to WAL-log hints for checksumming?
I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page. Given that we're
paying that cost, I don't see why we'd need to do any extra WAL-logging
(above and beyond the log-when-freeze cost that we have to pay already).
But I've not absorbed any caffeine yet today, so maybe I'm just missing
it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of

taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
JgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But

the master may have gone ahead and made changes to its local file system
after > flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of > master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the

old master wants to be a standby of the new master, it can't just join the

setup since there is inconsistency in between these two servers. We need

to take the fresh backup from the new master. This can happen in both the

synchronous as well as asynchronous replication.

Fresh backup is also needed in case of clean switch-over because in the

current HEAD, the master does not wait for the standby to receive all the
WAL

up to the shutdown checkpoint record before shutting down the connection.

Fujii Masao has already submitted a patch to handle clean switch-over case,

but the problem is still remaining for failback case.

The process of taking fresh backup is very time consuming when databases

are of very big sizes, say several TB's, and when the servers are connected

over a relatively slower link. This would break the service level

agreement of disaster recovery system. So there is need to improve the
process of

disaster recovery in PostgreSQL. One way to achieve this is to maintain

consistency between master and standby which helps to avoid need of fresh

backup.

So our proposal on this problem is that we must ensure that master should

not make any file system level changes without confirming that the

corresponding WAL record is replicated to the standby.

How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Tom Lane (#9)

Re: Patch for fail-back without fresh backup

On 2013-06-14 09:08:15 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Well, time will tell I guess. The biggest overhead with the checksums is
exactly the WAL-logging of hint bits.

Refresh my memory as to why we need to WAL-log hints for checksumming?
I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page.

I am really rather uncomfortable with that piece of code, and I hacked
it up after Jeff Janes had reported a bug there (The one aborting WAL
replay to early...). So I am very happy that you are looking at it.

Jeff Davis and I were talking about whether the usage of
PGXAC->delayChkpt makes the whole thing sufficiently safe at pgcon - we
couldn't find any real danger but...

Given that we're
paying that cost, I don't see why we'd need to do any extra WAL-logging
(above and beyond the log-when-freeze cost that we have to pay already).
But I've not absorbed any caffeine yet today, so maybe I'm just missing
it.

The usual torn page spiel I think. If we crash while only one half of
the page made it to disk we would get spurious checksum failures from
there on.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Tom Lane (#9)

Re: Patch for fail-back without fresh backup

On 14.06.2013 16:08, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

Well, time will tell I guess. The biggest overhead with the checksums is
exactly the WAL-logging of hint bits.

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

1. Backend sets a hint bit, dirtying the buffer.
2. Checksum is calculated, and buffer is written out to disk.
3. <crash>

If the page is torn, the checksum won't match. Without checksums, a torn
page is not a problem with hint bits, as a single bit can't be torn and
the page is otherwise intact. But with checksums, it causes a checksum
failure.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Heikki Linnakangas (#12)

Re: Patch for fail-back without fresh backup

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 14.06.2013 16:08, Tom Lane wrote:

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Tom Lane (#13)

Re: Patch for fail-back without fresh backup

On 14.06.2013 16:21, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

On 14.06.2013 16:08, Tom Lane wrote:

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

Correct. We're doing the latter, see XLogSaveBufferForHint().

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Tom Lane (#13)

Re: Patch for fail-back without fresh backup

On 2013-06-14 09:21:52 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 14.06.2013 16:08, Tom Lane wrote:

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

MarkBufferDirtyHint() loggs an FPI (just not via a BKP block) via
XLogSaveBufferForHint() iff XLogCheckBuffer() says we need to by
comparing GetRedoRecPtr() with the page's lsn.
Otherwise we don't do anything besides marking the buffer dirty.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Andres Freund (#11)

Re: Patch for fail-back without fresh backup

On 14.06.2013 16:15, Andres Freund wrote:

On 2013-06-14 09:08:15 -0400, Tom Lane wrote:

I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page.

I am really rather uncomfortable with that piece of code, and I hacked
it up after Jeff Janes had reported a bug there (The one aborting WAL
replay to early...). So I am very happy that you are looking at it.

Hmm. In XLogSaveBufferForHint():

* Note that this only works for buffers that fit the standard page model,
* i.e. those for which buffer_std == true

The free-space-map uses non-standard pages, and MarkBufferDirtyHint().
Isn't that completely broken for the FSM? If I'm reading it correctly,
what will happen is that replay will completely zero out all FSM pages
that have been touched. All the FSM data is between pd_lower and
pd_upper, which on standard pages is the "hole".

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Tom Lane (#13)

Re: Patch for fail-back without fresh backup

On 2013-06-14 09:21:52 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 14.06.2013 16:08, Tom Lane wrote:

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

From quickly looking at the code again I think the MarkBufferDirtyHint()
code makes at least one assumption that isn't correct in the fact of
checksums.

It tests for the need to dirty the page with:
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))

*before* taking a lock. A comment explains why that is safe:

* Since we make this test unlocked, there's a chance we
* might fail to notice that the flags have just been cleared, and failed
* to reset them, due to memory-ordering issues.

That's fine for the classical usecase without checksums but what about
the following scenario:

1) page is dirtied, FPI is logged
2) SetHintBits gets called on the same page, holding only a share lock
3) checkpointer/bgwriter/... writes out the the page, clearing the dirty
flag
4) checkpoint finishes, updates redo ptr
5) SetHintBits actually modifies the hint bits
6) SetHintBits calls MarkBufferDirtyHint which doesn't notice that the
page isn't dirty anymore and thus doesn't check whether something
needs to get logged.

At this point we have a page that has been modified without an FPI. But
it's not marked dirty, so it won't be written out without further
cause. Which might be fine since there's no cause to write out the page
and there probably won't be anyone doing that without logging an FPI
independently.
Can anybody see a scenario where this is actually dangerous?

Since

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Andres Freund (#17)

Re: Patch for fail-back without fresh backup

On 14.06.2013 17:01, Andres Freund wrote:

At this point we have a page that has been modified without an FPI. But
it's not marked dirty, so it won't be written out without further
cause. Which might be fine since there's no cause to write out the page
and there probably won't be anyone doing that without logging an FPI
independently.
Can anybody see a scenario where this is actually dangerous?

The code also relies on that being safe during recovery:

* If we're in recovery we cannot dirty a page because of a hint.
* We can set the hint, just not dirty the page as a result so the
* hint is lost when we evict the page or shutdown.
*
* See src/backend/storage/page/README for longer discussion.
*/
if (RecoveryInProgress())
return;

I can't immediately see a problem with that.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Heikki Linnakangas (#16)

Re: Patch for fail-back without fresh backup

On 2013-06-14 16:58:38 +0300, Heikki Linnakangas wrote:

On 14.06.2013 16:15, Andres Freund wrote:

On 2013-06-14 09:08:15 -0400, Tom Lane wrote:

I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page.

I am really rather uncomfortable with that piece of code, and I hacked
it up after Jeff Janes had reported a bug there (The one aborting WAL
replay to early...). So I am very happy that you are looking at it.

Hmm. In XLogSaveBufferForHint():

* Note that this only works for buffers that fit the standard page model,
* i.e. those for which buffer_std == true

The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). Isn't
that completely broken for the FSM? If I'm reading it correctly, what will
happen is that replay will completely zero out all FSM pages that have been
touched. All the FSM data is between pd_lower and pd_upper, which on
standard pages is the "hole".

Jeff Davis has a patch pending
(1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
were on track committing that, but rereading the thread it doesn't look
that way.

Jeff, care to update that patch?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Greg Stark

stark@mit.edu

over 12 years ago

In reply to: Tom Lane (#13)

Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

Wal logging a full page image after a checkpoint wouldn't actually be
enough since subsequent hint bits will dirty the page and not wal log
anything creating a new torn page risk. FPI are only useful if all the
subsequent updates are wal logged.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Tom Lane

tgl@sss.pgh.pa.us

over 12 years ago

In reply to: Greg Stark (#20)

Re: Patch for fail-back without fresh backup

Greg Stark <stark@mit.edu> writes:

On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

Wal logging a full page image after a checkpoint wouldn't actually be
enough since subsequent hint bits will dirty the page and not wal log
anything creating a new torn page risk. FPI are only useful if all the
subsequent updates are wal logged.

No, there's no new torn page risk, because any crash recovery would
replay starting from the checkpoint. You might lose the
subsequently-set hint bits, but that's okay.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Jeff Davis

pgsql@j-davis.com

over 12 years ago

In reply to: Andres Freund (#19)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote:

Jeff Davis has a patch pending
(1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
were on track committing that, but rereading the thread it doesn't look
that way.

Jeff, care to update that patch?

Rebased and attached. Changed so all callers use buffer_std=true except
those in freespace.c and fsmpage.c.

Simon, did you (or anyone else) have an objection to this patch? If not,
I'll go ahead and commit it tomorrow morning.

Regards,
Jeff Davis

Attachments:

buffer-std-20130614.patchtext/x-patch; charset=UTF-8; name=buffer-std-20130614.patchDownload

*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
***************
*** 287,293 **** hashgettuple(PG_FUNCTION_ARGS)
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf);
  		}
  
  		/*
--- 287,293 ----
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf, true);
  		}
  
  		/*
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 262,268 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  		{
  			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer);
  		}
  	}
  
--- 262,268 ----
  		{
  			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer, true);
  		}
  	}
  
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 413,421 **** _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  					 * crucial. Be sure to mark the proper buffer dirty.
  					 */
  					if (nbuf != InvalidBuffer)
! 						MarkBufferDirtyHint(nbuf);
  					else
! 						MarkBufferDirtyHint(buf);
  				}
  			}
  		}
--- 413,421 ----
  					 * crucial. Be sure to mark the proper buffer dirty.
  					 */
  					if (nbuf != InvalidBuffer)
! 						MarkBufferDirtyHint(nbuf, true);
  					else
! 						MarkBufferDirtyHint(buf, true);
  				}
  			}
  		}
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 1052,1058 **** restart:
  				opaque->btpo_cycleid == vstate->cycleid)
  			{
  				opaque->btpo_cycleid = 0;
! 				MarkBufferDirtyHint(buf);
  			}
  		}
  
--- 1052,1058 ----
  				opaque->btpo_cycleid == vstate->cycleid)
  			{
  				opaque->btpo_cycleid = 0;
! 				MarkBufferDirtyHint(buf, true);
  			}
  		}
  
*** a/src/backend/access/nbtree/nbtutils.c
--- b/src/backend/access/nbtree/nbtutils.c
***************
*** 1789,1795 **** _bt_killitems(IndexScanDesc scan, bool haveLock)
  	if (killedsomething)
  	{
  		opaque->btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so->currPos.buf);
  	}
  
  	if (!haveLock)
--- 1789,1795 ----
  	if (killedsomething)
  	{
  		opaque->btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so->currPos.buf, true);
  	}
  
  	if (!haveLock)
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 7681,7692 **** XLogRestorePoint(const char *rpName)
   * records. In that case, multiple copies of the same block would be recorded
   * in separate WAL records by different backends, though that is still OK from
   * a correctness perspective.
-  *
-  * Note that this only works for buffers that fit the standard page model,
-  * i.e. those for which buffer_std == true
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer)
  {
  	XLogRecPtr	recptr = InvalidXLogRecPtr;
  	XLogRecPtr	lsn;
--- 7681,7689 ----
   * records. In that case, multiple copies of the same block would be recorded
   * in separate WAL records by different backends, though that is still OK from
   * a correctness perspective.
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
  {
  	XLogRecPtr	recptr = InvalidXLogRecPtr;
  	XLogRecPtr	lsn;
***************
*** 7708,7714 **** XLogSaveBufferForHint(Buffer buffer)
  	 * and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = true;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
--- 7705,7711 ----
  	 * and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = buffer_std;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
***************
*** 1118,1124 **** read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
  		HeapTupleHeaderSetXmax(seqtuple->t_data, InvalidTransactionId);
  		seqtuple->t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
  		seqtuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
! 		MarkBufferDirtyHint(*buf);
  	}
  
  	seq = (Form_pg_sequence) GETSTRUCT(seqtuple);
--- 1118,1124 ----
  		HeapTupleHeaderSetXmax(seqtuple->t_data, InvalidTransactionId);
  		seqtuple->t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
  		seqtuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
! 		MarkBufferDirtyHint(*buf, true);
  	}
  
  	seq = (Form_pg_sequence) GETSTRUCT(seqtuple);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 2587,2593 **** IncrBufferRefCount(Buffer buffer)
   *	  (due to a race condition), so it cannot be used for important changes.
   */
  void
! MarkBufferDirtyHint(Buffer buffer)
  {
  	volatile BufferDesc *bufHdr;
  	Page		page = BufferGetPage(buffer);
--- 2587,2593 ----
   *	  (due to a race condition), so it cannot be used for important changes.
   */
  void
! MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
  {
  	volatile BufferDesc *bufHdr;
  	Page		page = BufferGetPage(buffer);
***************
*** 2671,2677 **** MarkBufferDirtyHint(Buffer buffer)
  			 * rather than full transactionids.
  			 */
  			MyPgXact->delayChkpt = delayChkpt = true;
! 			lsn = XLogSaveBufferForHint(buffer);
  		}
  
  		LockBufHdr(bufHdr);
--- 2671,2677 ----
  			 * rather than full transactionids.
  			 */
  			MyPgXact->delayChkpt = delayChkpt = true;
! 			lsn = XLogSaveBufferForHint(buffer, buffer_std);
  		}
  
  		LockBufHdr(bufHdr);
*** a/src/backend/storage/freespace/freespace.c
--- b/src/backend/storage/freespace/freespace.c
***************
*** 216,222 **** XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
  		PageInit(page, BLCKSZ, 0);
  
  	if (fsm_set_avail(page, slot, new_cat))
! 		MarkBufferDirtyHint(buf);
  	UnlockReleaseBuffer(buf);
  }
  
--- 216,222 ----
  		PageInit(page, BLCKSZ, 0);
  
  	if (fsm_set_avail(page, slot, new_cat))
! 		MarkBufferDirtyHint(buf, false);
  	UnlockReleaseBuffer(buf);
  }
  
***************
*** 286,292 **** FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
  			return;				/* nothing to do; the FSM was already smaller */
  		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
! 		MarkBufferDirtyHint(buf);
  		UnlockReleaseBuffer(buf);
  
  		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
--- 286,292 ----
  			return;				/* nothing to do; the FSM was already smaller */
  		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
! 		MarkBufferDirtyHint(buf, false);
  		UnlockReleaseBuffer(buf);
  
  		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
***************
*** 619,625 **** fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
  	page = BufferGetPage(buf);
  
  	if (fsm_set_avail(page, slot, newValue))
! 		MarkBufferDirtyHint(buf);
  
  	if (minValue != 0)
  	{
--- 619,625 ----
  	page = BufferGetPage(buf);
  
  	if (fsm_set_avail(page, slot, newValue))
! 		MarkBufferDirtyHint(buf, false);
  
  	if (minValue != 0)
  	{
***************
*** 770,776 **** fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
  			{
  				LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  				fsm_set_avail(BufferGetPage(buf), slot, child_avail);
! 				MarkBufferDirtyHint(buf);
  				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
  			}
  		}
--- 770,776 ----
  			{
  				LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  				fsm_set_avail(BufferGetPage(buf), slot, child_avail);
! 				MarkBufferDirtyHint(buf, false);
  				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
  			}
  		}
*** a/src/backend/storage/freespace/fsmpage.c
--- b/src/backend/storage/freespace/fsmpage.c
***************
*** 284,290 **** restart:
  				exclusive_lock_held = true;
  			}
  			fsm_rebuild_page(page);
! 			MarkBufferDirtyHint(buf);
  			goto restart;
  		}
  	}
--- 284,290 ----
  				exclusive_lock_held = true;
  			}
  			fsm_rebuild_page(page);
! 			MarkBufferDirtyHint(buf, false);
  			goto restart;
  		}
  	}
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 121,127 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer,
  	}
  
  	tuple->t_infomask |= infomask;
! 	MarkBufferDirtyHint(buffer);
  }
  
  /*
--- 121,127 ----
  	}
  
  	tuple->t_infomask |= infomask;
! 	MarkBufferDirtyHint(buffer, true);
  }
  
  /*
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 267,273 **** extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
  extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
  extern int	XLogFileOpen(XLogSegNo segno);
  
! extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer);
  
  extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);
--- 267,273 ----
  extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
  extern int	XLogFileOpen(XLogSegNo segno);
  
! extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer, bool buffer_std);
  
  extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 204,210 **** extern Size BufferShmemSize(void);
  extern void BufferGetTag(Buffer buffer, RelFileNode *rnode,
  			 ForkNumber *forknum, BlockNumber *blknum);
  
! extern void MarkBufferDirtyHint(Buffer buffer);
  
  extern void UnlockBuffers(void);
  extern void LockBuffer(Buffer buffer, int mode);
--- 204,210 ----
  extern void BufferGetTag(Buffer buffer, RelFileNode *rnode,
  			 ForkNumber *forknum, BlockNumber *blknum);
  
! extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
  
  extern void UnlockBuffers(void);
  extern void LockBuffer(Buffer buffer, int mode);

#23

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Jeff Davis (#22)

Re: Patch for fail-back without fresh backup

On 2013-06-14 09:21:12 -0700, Jeff Davis wrote:

On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote:

Jeff Davis has a patch pending
(1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
were on track committing that, but rereading the thread it doesn't look
that way.

Jeff, care to update that patch?

Rebased and attached. Changed so all callers use buffer_std=true except
those in freespace.c and fsmpage.c.

Simon, did you (or anyone else) have an objection to this patch? If not,
I'll go ahead and commit it tomorrow morning.

I'd like to see a comment around the memcpys in XLogSaveBufferForHint()
that mentions that they are safe in a non std buffer due to
XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit
change of the copy algorithm there.

Btw, if you touch that code, I'd vote for renaming XLOG_HINT to XLOG_FPI
or something like that. I find the former name confusing...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of

taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
JgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But

the master may have gone ahead and made changes to its local file system
after > flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of > master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the

old master wants to be a standby of the new master, it can't just join the

setup since there is inconsistency in between these two servers. We need

to take the fresh backup from the new master. This can happen in both the

synchronous as well as asynchronous replication.

Fresh backup is also needed in case of clean switch-over because in the

current HEAD, the master does not wait for the standby to receive all the
WAL

up to the shutdown checkpoint record before shutting down the connection.

Fujii Masao has already submitted a patch to handle clean switch-over case,

but the problem is still remaining for failback case.

The process of taking fresh backup is very time consuming when databases

are of very big sizes, say several TB's, and when the servers are connected

over a relatively slower link. This would break the service level

agreement of disaster recovery system. So there is need to improve the
process of

disaster recovery in PostgreSQL. One way to achieve this is to maintain

consistency between master and standby which helps to avoid need of fresh

backup.

So our proposal on this problem is that we must ensure that master should

not make any file system level changes without confirming that the

corresponding WAL record is replicated to the standby.

How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

you means that there is possible that old master's data ahead of new
master's data.
so there is inconsistent data between those server when fail back. right?
if so , there is not possible inconsistent. because if you use GUC option
as his propose (i.g., failback_safe_standby_mode = remote_flush),
when old master is working fine, all file system level changes aren't
done before WAL replicated.

--
Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 51bb17e7.05d80e0a.1392.ffffcb05SMTPIN_ADDED_BROKEN@mx.google.com

#25

Amit kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Sawada Masahiko (#24)

Re: Patch for fail-back without fresh backup

On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
JgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

you means that there is possible that old master's data ahead of new
master's data.

I mean to say is that WAL of old master can be ahead of new master. I understood that
data files of old master can't be ahead, but I think WAL can be ahead.

so there is inconsistent data between those server when fail back. right?
if so , there is not possible inconsistent. because if you use GUC option
as his propose (i.g., failback_safe_standby_mode = remote_flush),
when old master is working fine, all file system level changes aren't
done before WAL replicated.

Would the propose patch will take care that old master's WAL is also not ahead in some way?
If yes, I think i am missing some point.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Amit kapila (#25)

Re: Patch for fail-back without fresh backup

On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
JgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

you means that there is possible that old master's data ahead of new
master's data.

I mean to say is that WAL of old master can be ahead of new master. I understood that
data files of old master can't be ahead, but I think WAL can be ahead.

so there is inconsistent data between those server when fail back. right?
if so , there is not possible inconsistent. because if you use GUC option
as his propose (i.g., failback_safe_standby_mode = remote_flush),
when old master is working fine, all file system level changes aren't
done before WAL replicated.

Would the propose patch will take care that old master's WAL is also not ahead in some way?
If yes, I think i am missing some point.

yes it will happen that old master's WAL ahead of new master's WAL as you said.
but I think that we can solve them by delete all WAL file when old
master starts as new standby.
thought?

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Jeff Davis

pgsql@j-davis.com

over 12 years ago

In reply to: Andres Freund (#23)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Fri, 2013-06-14 at 18:27 +0200, Andres Freund wrote:

I'd like to see a comment around the memcpys in XLogSaveBufferForHint()
that mentions that they are safe in a non std buffer due to
XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit
change of the copy algorithm there.

Done.

Btw, if you touch that code, I'd vote for renaming XLOG_HINT to XLOG_FPI
or something like that. I find the former name confusing...

Also done.

Patch attached. Also, since we branched, I think this should be
back-patched to 9.3 as well.

Regards,
Jeff Davis

Attachments:

buffer-std-20130615.patchtext/x-patch; charset=UTF-8; name=buffer-std-20130615.patchDownload

*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
***************
*** 287,293 **** hashgettuple(PG_FUNCTION_ARGS)
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf);
  		}
  
  		/*
--- 287,293 ----
  			/*
  			 * Since this can be redone later if needed, mark as a hint.
  			 */
! 			MarkBufferDirtyHint(buf, true);
  		}
  
  		/*
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 262,268 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  		{
  			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer);
  		}
  	}
  
--- 262,268 ----
  		{
  			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
  			PageClearFull(page);
! 			MarkBufferDirtyHint(buffer, true);
  		}
  	}
  
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 413,421 **** _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
  					 * crucial. Be sure to mark the proper buffer dirty.
  					 */
  					if (nbuf != InvalidBuffer)
! 						MarkBufferDirtyHint(nbuf);
  					else
! 						MarkBufferDirtyHint(buf);
  				}
  			}
  		}
--- 413,421 ----
  					 * crucial. Be sure to mark the proper buffer dirty.
  					 */
  					if (nbuf != InvalidBuffer)
! 						MarkBufferDirtyHint(nbuf, true);
  					else
! 						MarkBufferDirtyHint(buf, true);
  				}
  			}
  		}
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 1052,1058 **** restart:
  				opaque->btpo_cycleid == vstate->cycleid)
  			{
  				opaque->btpo_cycleid = 0;
! 				MarkBufferDirtyHint(buf);
  			}
  		}
  
--- 1052,1058 ----
  				opaque->btpo_cycleid == vstate->cycleid)
  			{
  				opaque->btpo_cycleid = 0;
! 				MarkBufferDirtyHint(buf, true);
  			}
  		}
  
*** a/src/backend/access/nbtree/nbtutils.c
--- b/src/backend/access/nbtree/nbtutils.c
***************
*** 1789,1795 **** _bt_killitems(IndexScanDesc scan, bool haveLock)
  	if (killedsomething)
  	{
  		opaque->btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so->currPos.buf);
  	}
  
  	if (!haveLock)
--- 1789,1795 ----
  	if (killedsomething)
  	{
  		opaque->btpo_flags |= BTP_HAS_GARBAGE;
! 		MarkBufferDirtyHint(so->currPos.buf, true);
  	}
  
  	if (!haveLock)
*** a/src/backend/access/rmgrdesc/xlogdesc.c
--- b/src/backend/access/rmgrdesc/xlogdesc.c
***************
*** 82,92 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
  		appendStringInfo(buf, "restore point: %s", xlrec->rp_name);
  
  	}
! 	else if (info == XLOG_HINT)
  	{
  		BkpBlock   *bkp = (BkpBlock *) rec;
  
! 		appendStringInfo(buf, "page hint: %s block %u",
  						 relpathperm(bkp->node, bkp->fork),
  						 bkp->block);
  	}
--- 82,92 ----
  		appendStringInfo(buf, "restore point: %s", xlrec->rp_name);
  
  	}
! 	else if (info == XLOG_FPI)
  	{
  		BkpBlock   *bkp = (BkpBlock *) rec;
  
! 		appendStringInfo(buf, "full-page image: %s block %u",
  						 relpathperm(bkp->node, bkp->fork),
  						 bkp->block);
  	}
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 7681,7692 **** XLogRestorePoint(const char *rpName)
   * records. In that case, multiple copies of the same block would be recorded
   * in separate WAL records by different backends, though that is still OK from
   * a correctness perspective.
-  *
-  * Note that this only works for buffers that fit the standard page model,
-  * i.e. those for which buffer_std == true
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer)
  {
  	XLogRecPtr	recptr = InvalidXLogRecPtr;
  	XLogRecPtr	lsn;
--- 7681,7689 ----
   * records. In that case, multiple copies of the same block would be recorded
   * in separate WAL records by different backends, though that is still OK from
   * a correctness perspective.
   */
  XLogRecPtr
! XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
  {
  	XLogRecPtr	recptr = InvalidXLogRecPtr;
  	XLogRecPtr	lsn;
***************
*** 7708,7714 **** XLogSaveBufferForHint(Buffer buffer)
  	 * and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = true;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
--- 7705,7711 ----
  	 * and reset rdata for any actual WAL record insert.
  	 */
  	rdata[0].buffer = buffer;
! 	rdata[0].buffer_std = buffer_std;
  
  	/*
  	 * Check buffer while not holding an exclusive lock.
***************
*** 7722,7727 **** XLogSaveBufferForHint(Buffer buffer)
--- 7719,7727 ----
  		 * Copy buffer so we don't have to worry about concurrent hint bit or
  		 * lsn updates. We assume pd_lower/upper cannot be changed without an
  		 * exclusive lock, so the contents bkp are not racy.
+ 		 *
+ 		 * With buffer_std set to false, XLogCheckBuffer() sets hole_length and
+ 		 * hole_offset to 0; so the following code is safe for either case.
  		 */
  		memcpy(copied_buffer, origdata, bkpb.hole_offset);
  		memcpy(copied_buffer + bkpb.hole_offset,
***************
*** 7744,7750 **** XLogSaveBufferForHint(Buffer buffer)
  		rdata[1].buffer = InvalidBuffer;
  		rdata[1].next = NULL;
  
! 		recptr = XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
  	}
  
  	return recptr;
--- 7744,7750 ----
  		rdata[1].buffer = InvalidBuffer;
  		rdata[1].next = NULL;
  
! 		recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
  	}
  
  	return recptr;
***************
*** 8109,8122 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  	{
  		/* nothing to do here */
  	}
! 	else if (info == XLOG_HINT)
  	{
  		char	   *data;
  		BkpBlock	bkpb;
  
  		/*
! 		 * Hint bit records contain a backup block stored "inline" in the
! 		 * normal data since the locking when writing hint records isn't
  		 * sufficient to use the normal backup block mechanism, which assumes
  		 * exclusive lock on the buffer supplied.
  		 *
--- 8109,8122 ----
  	{
  		/* nothing to do here */
  	}
! 	else if (info == XLOG_FPI)
  	{
  		char	   *data;
  		BkpBlock	bkpb;
  
  		/*
! 		 * Full-page image (FPI) records contain a backup block stored "inline"
! 		 * in the normal data since the locking when writing hint records isn't
  		 * sufficient to use the normal backup block mechanism, which assumes
  		 * exclusive lock on the buffer supplied.
  		 *
*** a/src/backend/commands/sequence.c
--- b/src/backend/commands/sequence.c
***************
*** 1118,1124 **** read_seq_tuple(SeqTable elm, Relation rel, Buffer *buf, HeapTuple seqtuple)
  		HeapTupleHeaderSetXmax(seqtuple->t_data, InvalidTransactionId);
  		seqtuple->t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
  		seqtuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
! 		MarkBufferDirtyHint(*buf);
  	}
  
  	seq = (Form_pg_sequence) GETSTRUCT(seqtuple);
--- 1118,1124 ----
  		HeapTupleHeaderSetXmax(seqtuple->t_data, InvalidTransactionId);
  		seqtuple->t_data->t_infomask &= ~HEAP_XMAX_COMMITTED;
  		seqtuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
! 		MarkBufferDirtyHint(*buf, true);
  	}
  
  	seq = (Form_pg_sequence) GETSTRUCT(seqtuple);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 2587,2593 **** IncrBufferRefCount(Buffer buffer)
   *	  (due to a race condition), so it cannot be used for important changes.
   */
  void
! MarkBufferDirtyHint(Buffer buffer)
  {
  	volatile BufferDesc *bufHdr;
  	Page		page = BufferGetPage(buffer);
--- 2587,2593 ----
   *	  (due to a race condition), so it cannot be used for important changes.
   */
  void
! MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
  {
  	volatile BufferDesc *bufHdr;
  	Page		page = BufferGetPage(buffer);
***************
*** 2671,2677 **** MarkBufferDirtyHint(Buffer buffer)
  			 * rather than full transactionids.
  			 */
  			MyPgXact->delayChkpt = delayChkpt = true;
! 			lsn = XLogSaveBufferForHint(buffer);
  		}
  
  		LockBufHdr(bufHdr);
--- 2671,2677 ----
  			 * rather than full transactionids.
  			 */
  			MyPgXact->delayChkpt = delayChkpt = true;
! 			lsn = XLogSaveBufferForHint(buffer, buffer_std);
  		}
  
  		LockBufHdr(bufHdr);
*** a/src/backend/storage/freespace/freespace.c
--- b/src/backend/storage/freespace/freespace.c
***************
*** 216,222 **** XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
  		PageInit(page, BLCKSZ, 0);
  
  	if (fsm_set_avail(page, slot, new_cat))
! 		MarkBufferDirtyHint(buf);
  	UnlockReleaseBuffer(buf);
  }
  
--- 216,222 ----
  		PageInit(page, BLCKSZ, 0);
  
  	if (fsm_set_avail(page, slot, new_cat))
! 		MarkBufferDirtyHint(buf, false);
  	UnlockReleaseBuffer(buf);
  }
  
***************
*** 286,292 **** FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
  			return;				/* nothing to do; the FSM was already smaller */
  		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
! 		MarkBufferDirtyHint(buf);
  		UnlockReleaseBuffer(buf);
  
  		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
--- 286,292 ----
  			return;				/* nothing to do; the FSM was already smaller */
  		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
! 		MarkBufferDirtyHint(buf, false);
  		UnlockReleaseBuffer(buf);
  
  		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
***************
*** 619,625 **** fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
  	page = BufferGetPage(buf);
  
  	if (fsm_set_avail(page, slot, newValue))
! 		MarkBufferDirtyHint(buf);
  
  	if (minValue != 0)
  	{
--- 619,625 ----
  	page = BufferGetPage(buf);
  
  	if (fsm_set_avail(page, slot, newValue))
! 		MarkBufferDirtyHint(buf, false);
  
  	if (minValue != 0)
  	{
***************
*** 770,776 **** fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
  			{
  				LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  				fsm_set_avail(BufferGetPage(buf), slot, child_avail);
! 				MarkBufferDirtyHint(buf);
  				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
  			}
  		}
--- 770,776 ----
  			{
  				LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
  				fsm_set_avail(BufferGetPage(buf), slot, child_avail);
! 				MarkBufferDirtyHint(buf, false);
  				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
  			}
  		}
*** a/src/backend/storage/freespace/fsmpage.c
--- b/src/backend/storage/freespace/fsmpage.c
***************
*** 284,290 **** restart:
  				exclusive_lock_held = true;
  			}
  			fsm_rebuild_page(page);
! 			MarkBufferDirtyHint(buf);
  			goto restart;
  		}
  	}
--- 284,290 ----
  				exclusive_lock_held = true;
  			}
  			fsm_rebuild_page(page);
! 			MarkBufferDirtyHint(buf, false);
  			goto restart;
  		}
  	}
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 121,127 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer,
  	}
  
  	tuple->t_infomask |= infomask;
! 	MarkBufferDirtyHint(buffer);
  }
  
  /*
--- 121,127 ----
  	}
  
  	tuple->t_infomask |= infomask;
! 	MarkBufferDirtyHint(buffer, true);
  }
  
  /*
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 267,273 **** extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
  extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
  extern int	XLogFileOpen(XLogSegNo segno);
  
! extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer);
  
  extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);
--- 267,273 ----
  extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
  extern int	XLogFileOpen(XLogSegNo segno);
  
! extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer, bool buffer_std);
  
  extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
  extern void XLogSetAsyncXactLSN(XLogRecPtr record);
*** a/src/include/catalog/pg_control.h
--- b/src/include/catalog/pg_control.h
***************
*** 67,73 **** typedef struct CheckPoint
  #define XLOG_RESTORE_POINT				0x70
  #define XLOG_FPW_CHANGE					0x80
  #define XLOG_END_OF_RECOVERY			0x90
! #define XLOG_HINT						0xA0
  
  
  /*
--- 67,73 ----
  #define XLOG_RESTORE_POINT				0x70
  #define XLOG_FPW_CHANGE					0x80
  #define XLOG_END_OF_RECOVERY			0x90
! #define XLOG_FPI						0xA0
  
  
  /*
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 204,210 **** extern Size BufferShmemSize(void);
  extern void BufferGetTag(Buffer buffer, RelFileNode *rnode,
  			 ForkNumber *forknum, BlockNumber *blknum);
  
! extern void MarkBufferDirtyHint(Buffer buffer);
  
  extern void UnlockBuffers(void);
  extern void LockBuffer(Buffer buffer, int mode);
--- 204,210 ----
  extern void BufferGetTag(Buffer buffer, RelFileNode *rnode,
  			 ForkNumber *forknum, BlockNumber *blknum);
  
! extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
  
  extern void UnlockBuffers(void);
  extern void LockBuffer(Buffer buffer, int mode);

#28

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Jeff Davis (#27)

Re: Patch for fail-back without fresh backup

On 2013-06-15 11:36:54 -0700, Jeff Davis wrote:

On Fri, 2013-06-14 at 18:27 +0200, Andres Freund wrote:

I'd like to see a comment around the memcpys in XLogSaveBufferForHint()
that mentions that they are safe in a non std buffer due to
XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit
change of the copy algorithm there.

Done.
Also done.

Thanks! Looks good to me.

Patch attached. Also, since we branched, I think this should be
back-patched to 9.3 as well.

Absolutely.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Amit kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Sawada Masahiko (#26)

Re: Patch for fail-back without fresh backup

On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
JgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

you means that there is possible that old master's data ahead of new
master's data.

I mean to say is that WAL of old master can be ahead of new master. I understood that
data files of old master can't be ahead, but I think WAL can be ahead.

so there is inconsistent data between those server when fail back. right?
if so , there is not possible inconsistent. because if you use GUC option
as his propose (i.g., failback_safe_standby_mode = remote_flush),
when old master is working fine, all file system level changes aren't
done before WAL replicated.

Would the propose patch will take care that old master's WAL is also not ahead in some way?
If yes, I think i am missing some point.

yes it will happen that old master's WAL ahead of new master's WAL as you said.
but I think that we can solve them by delete all WAL file when old
master starts as new standby.

I think ideally, it should reset WAL location at the point where new master has forrked off.
In such a scenario it would be difficult for user who wants to get a dump of some data in
old master which hasn't gone to new master. I am not sure if such a need is there for real users, but if it
is there, then providing this solution will have some drawbacks.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Jeff Davis (#22)

Re: Patch for fail-back without fresh backup

On 14 June 2013 17:21, Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote:

Jeff Davis has a patch pending
(1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
were on track committing that, but rereading the thread it doesn't look
that way.

Jeff, care to update that patch?

Rebased and attached. Changed so all callers use buffer_std=true except
those in freespace.c and fsmpage.c.

Simon, did you (or anyone else) have an objection to this patch? If not,
I'll go ahead and commit it tomorrow morning.

I didn't have a specific objection to the patch, I just wanted to
minimise change relating to this so we didn't introduce further bugs.

I've no objection to you committing that.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

On 14 June 2013 10:11, Samrat Revagade <revagade.samrat@gmail.com> wrote:

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

So our proposal on this problem is that we must ensure that master should
not make any file system level changes without confirming that the
corresponding WAL record is replicated to the standby.

1. The main objection was raised by Tom and others is that we should not add
this feature and should go with traditional way of taking fresh backup using
the rsync, because he was concerned about the additional complexity of the
patch and the performance overhead during normal operations.

2. Tom and others were also worried about the inconsistencies in the crashed
master and suggested that its better to start with a fresh backup. Fujii
Masao and others correctly countered that suggesting that we trust WAL
recovery to clear all such inconsistencies and there is no reason why we
can't do the same here.

So the patch is showing 1-2% performance overhead.

Let's have a look at this...

The objections you summarise that Tom has made are ones that I agree
with. I also don't think that Fujii "correctly countered" those
objections.

My perspective is that if the master crashed, assuming that you know
everything about that and suddenly jumping back on seem like a recipe
for disaster. Attempting that is currently blocked by the technical
obstacles you've identified, but that doesn't mean they are the only
ones - we don't yet understand what all the problems lurking might be.
Personally, I won't be following you onto that minefield anytime soon.

So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

I'm worried to see that adding this feature and yet turning it off
causes a measureable drop in performance. I don't think we want that
at all. That clearly needs more work and thought.

I also think your performance results are somewhat bogus. Fast
transaction workloads were already mostly commit waits - measurements
of what happens to large loads, index builds etc would likely reveal
something quite different.

I'm tempted by the thought that we should put the WaitForLSN inside
XLogFlush, rather than scatter additional calls everywhere and then
have us inevitably miss one.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Samrat Revagade

revagade.samrat@gmail.com

over 12 years ago

In reply to: Simon Riggs (#31)

Re: Patch for fail-back without fresh backup

On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

I agree with you about the fact that,
Now a days the need of fresh backup in crash recovery seems to be a major
problem.
we might need to change the name of patch if there other problems too with
crash recovery.

The idea of another slew of parameters that are very similar to

synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

The different set of parameters are needed to differentiate between
fail-safe standby and synchronous standby, the fail-safe standby and
standby in synchronous replication can be two different servers.

I'm worried to see that adding this feature and yet turning it off
causes a measureable drop in performance. I don't think we want that
at all. That clearly needs more work and thought.

I also think your performance results are somewhat bogus. Fast
transaction workloads were already mostly commit waits - measurements
of what happens to large loads, index builds etc would likely reveal
something quite different.

I will test the other scenarios and post the results.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Regards,

Samrat Revgade

#33

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Samrat Revagade (#32)

Re: Patch for fail-back without fresh backup

On 16 June 2013 17:25, Samrat Revagade <revagade.samrat@gmail.com> wrote:

On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

I agree with you about the fact that,
Now a days the need of fresh backup in crash recovery seems to be a major
problem.
we might need to change the name of patch if there other problems too with
crash recovery.

(Sorry don't understand)

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

The different set of parameters are needed to differentiate between
fail-safe standby and synchronous standby, the fail-safe standby and standby
in synchronous replication can be two different servers.

Why would they be different? What possible reason would you have for
that config? There is no *need* for those parameters, the proposal
could work perfectly well without them.

Let's make this patch fulfill the stated objectives, not add in
optional extras, especially ones that don't appear well thought
through. If you wish to enhance the design for the specification of
multi-node sync rep, make that a separate patch, later.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Samrat Revagade

revagade.samrat@gmail.com

over 12 years ago

In reply to: Simon Riggs (#33)

Re: Patch for fail-back without fresh backup

On Sun, Jun 16, 2013 at 11:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 16 June 2013 17:25, Samrat Revagade <revagade.samrat@gmail.com> wrote:

On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com>

wrote:

So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

I agree with you about the fact that,
Now a days the need of fresh backup in crash recovery seems to be a

major

problem.
we might need to change the name of patch if there other problems too

with

crash recovery.

(Sorry don't understand)

Sorry for the confusion. I will change name of a patch.

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

The different set of parameters are needed to differentiate between
fail-safe standby and synchronous standby, the fail-safe standby and

standby

in synchronous replication can be two different servers.

Why would they be different? What possible reason would you have for
that config? There is no *need* for those parameters, the proposal
could work perfectly well without them.

Let's make this patch fulfill the stated objectives, not add in
optional extras, especially ones that don't appear well thought
through. If you wish to enhance the design for the specification of
multi-node sync rep, make that a separate patch, later.

I agree with you.I will remove the extra parameters if they are not

required in next version of the patch.

--
Regards,

Samrat Revgade

#35

Pavan Deolasee

pavan.deolasee@gmail.com

over 12 years ago

In reply to: Simon Riggs (#31)

Re: Patch for fail-back without fresh backup

On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

My perspective is that if the master crashed, assuming that you know
everything about that and suddenly jumping back on seem like a recipe
for disaster. Attempting that is currently blocked by the technical
obstacles you've identified, but that doesn't mean they are the only
ones - we don't yet understand what all the problems lurking might be.
Personally, I won't be following you onto that minefield anytime soon.

Would it be fair to say that a user will be willing to trust her crashed
master in all scenarios where she would have done so in a single instance
setup ? IOW without the replication setup, AFAIU users have traditionally
trusted the WAL recovery to recover from failed instances. This would
include some common failures such as power outages and hardware failures,
but may not include others such as on disk corruption.

So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).

I agree. We should probably find a better name for this. Any suggestions ?

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

Its an interesting idea, but I think there is some difference here. For
example, the proposed feature allows a backend to wait at other points but
not commit. Since commits are more foreground in nature and this feature
does not require us to wait during common foreground activities, we want a
configuration where master can wait for synchronous transfers at other than
commits. May we can solve that by having more granular control to the said
parameter ?

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

How would we then distinguish between synchronous and the new kind of
standby ? I am told, one of the very popular setups for DR is to have one
local sync standby and one async (may be cascaded by the local sync). Since
this new feature is more useful for DR because taking a fresh backup on a
slower link is even more challenging, IMHO we should support such setups.

I'm worried to see that adding this feature and yet turning it off
causes a measureable drop in performance. I don't think we want that
at all. That clearly needs more work and thought.

I agree. We need to repeat those tests. I don't trust that turning the
feature is causing 1-2% drop. In one of the tests, I see turning the
feature on is showing better number compared to when its turn off. That's
clearly noise or need concrete argument to convince that way.

I also think your performance results are somewhat bogus. Fast
transaction workloads were already mostly commit waits -

But not in case of async standby, right ?

measurements
of what happens to large loads, index builds etc would likely reveal
something quite different.

I agree. I also feel we need tests where the FlushBuffer gets called more
often by the normal backends to see how much added wait in that code path
causes performance drops. Another important thing to test would be to see
how it works on a slower/high latency links.

I'm tempted by the thought that we should put the WaitForLSN inside
XLogFlush, rather than scatter additional calls everywhere and then
have us inevitably miss one.

That indeed seems cleaner.

Thanks,
Pavan

#36

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Pavan Deolasee (#35)

Re: Patch for fail-back without fresh backup

On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I agree. We should probably find a better name for this. Any suggestions ?

err, I already made one...

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

Since commits are more foreground in nature and this feature
does not require us to wait during common foreground activities, we want a
configuration where master can wait for synchronous transfers at other than
commits. May we can solve that by having more granular control to the said
parameter ?

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

How would we then distinguish between synchronous and the new kind of
standby ?

That's not the point. The point is "Why would we have a new kind of
standby?" and therefore why do we need new parameters?

I am told, one of the very popular setups for DR is to have one
local sync standby and one async (may be cascaded by the local sync). Since
this new feature is more useful for DR because taking a fresh backup on a
slower link is even more challenging, IMHO we should support such setups.

...which still doesn't make sense to me. Lets look at that in detail.

Take 3 servers, A, B, C with A and B being linked by sync rep, and C
being safety standby at a distance.

Either A or B is master, except in disaster. So if A is master, then B
would be the failover target. If A fails, then you want to failover to
B. Once B is the target, you want to failback to A as the master. C
needs to follow the new master, whichever it is.

If you set up sync rep between A and B and this new mode between A and
C. When B becomes the master, you need to failback from B from A, but
you can't because the new mode applied between A and C only, so you
have to failback from C to A. So having the new mode not match with
sync rep means you are forcing people to failback using the slow link
in the common case.

You might observe that having the two modes match causes problems if A
and B fail, so you are forced to go to C as master and then eventually
failback to A or B across a slow link. That case is less common and
could be solved by extending sync transfer to more/multi nodes.

It definitely doesn't make sense to have sync rep on anything other
than a subset of sync transfer. So while it may be sensible in the
future to make sync transfer a superset of sync rep nodes, it makes
sense to make them the same config for now.

Phew

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Amit kapila (#29)

Re: Patch for fail-back without fresh backup

On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
JgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

you means that there is possible that old master's data ahead of new
master's data.

I mean to say is that WAL of old master can be ahead of new master. I understood that
data files of old master can't be ahead, but I think WAL can be ahead.

so there is inconsistent data between those server when fail back. right?
if so , there is not possible inconsistent. because if you use GUC option
as his propose (i.g., failback_safe_standby_mode = remote_flush),
when old master is working fine, all file system level changes aren't
done before WAL replicated.

Would the propose patch will take care that old master's WAL is also not ahead in some way?
If yes, I think i am missing some point.

yes it will happen that old master's WAL ahead of new master's WAL as you said.
but I think that we can solve them by delete all WAL file when old
master starts as new standby.

I think ideally, it should reset WAL location at the point where new master has forrked off.
In such a scenario it would be difficult for user who wants to get a dump of some data in
old master which hasn't gone to new master. I am not sure if such a need is there for real users, but if it
is there, then providing this solution will have some drawbacks.

I think that we can dumping data before all WAL files deleting. All
WAL files deleting is done when old master starts as new standby.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Sawada Masahiko (#37)

Re: Patch for fail-back without fresh backup

On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:

On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com>
wrote:

On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila

<amit.kapila@huawei.com> wrote:

On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila

<amit.kapila@huawei.com> wrote:

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the

problem of

taking fresh backup during the failback operation here is the

link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-

6OzRaew5pWhk7yQtb

JgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

How will you take care of extra WAL on old master during

recovery. If it

plays the WAL which has not reached new-master, it can be a

problem.

you means that there is possible that old master's data ahead of

new

master's data.

I mean to say is that WAL of old master can be ahead of new

master. I understood that

data files of old master can't be ahead, but I think WAL can be

ahead.

so there is inconsistent data between those server when fail back.

right?

if so , there is not possible inconsistent. because if you use GUC

option

as his propose (i.g., failback_safe_standby_mode = remote_flush),
when old master is working fine, all file system level changes

aren't

done before WAL replicated.

Would the propose patch will take care that old master's WAL is

also not ahead in some way?

If yes, I think i am missing some point.

yes it will happen that old master's WAL ahead of new master's WAL

as you said.

but I think that we can solve them by delete all WAL file when old
master starts as new standby.

I think ideally, it should reset WAL location at the point where new

master has forrked off.

In such a scenario it would be difficult for user who wants to get a

dump of some data in

old master which hasn't gone to new master. I am not sure if such a

need is there for real users, but if it

is there, then providing this solution will have some drawbacks.

I think that we can dumping data before all WAL files deleting. All
WAL files deleting is done when old master starts as new standby.

Can we dump data without starting server?

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

On Tuesday, June 18, 2013, Amit Kapila wrote:

On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:

On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com<javascript:;>

wrote:

On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila

<amit.kapila@huawei.com <javascript:;>> wrote:

On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila

<amit.kapila@huawei.com <javascript:;>> wrote:

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the

problem of

taking fresh backup during the failback operation here is the

link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-

6OzRaew5pWhk7yQtb

JgWrFu513s+Q@mail.gmail.com <javascript:;>

Let me again summarize the problem we are trying to address.

How will you take care of extra WAL on old master during

recovery. If it

plays the WAL which has not reached new-master, it can be a

problem.

you means that there is possible that old master's data ahead of

new

master's data.

I mean to say is that WAL of old master can be ahead of new

master. I understood that

data files of old master can't be ahead, but I think WAL can be

ahead.

so there is inconsistent data between those server when fail back.

right?

if so , there is not possible inconsistent. because if you use GUC

option

as his propose (i.g., failback_safe_standby_mode = remote_flush),
when old master is working fine, all file system level changes

aren't

done before WAL replicated.

Would the propose patch will take care that old master's WAL is

also not ahead in some way?

If yes, I think i am missing some point.

yes it will happen that old master's WAL ahead of new master's WAL

as you said.

but I think that we can solve them by delete all WAL file when old
master starts as new standby.

I think ideally, it should reset WAL location at the point where new

master has forrked off.

In such a scenario it would be difficult for user who wants to get a

dump of some data in

old master which hasn't gone to new master. I am not sure if such a

need is there for real users, but if it

is there, then providing this solution will have some drawbacks.

I think that we can dumping data before all WAL files deleting. All
WAL files deleting is done when old master starts as new standby.

Can we dump data without starting server?

Sorry I made a mistake. We can't it.

this proposing patch need to be able to also handle such scenario in
future.

Regards,

---
Sawada Masahiko

--
Regards,

-------
Sawada Masahiko

Import Notes

Reply to msg id not found: 51bfe7df.412d0f0a.76f2.6163SMTPIN_ADDED_BROKEN@mx.google.com

#40

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

On Tuesday, June 18, 2013, Amit Kapila wrote:

On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:

On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com<javascript:;>

wrote:

On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila

<amit.kapila@huawei.com <javascript:;>> wrote:

On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila

<amit.kapila@huawei.com <javascript:;>> wrote:

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the

problem of

taking fresh backup during the failback operation here is the

link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-

6OzRaew5pWhk7yQtb

JgWrFu513s+Q@mail.gmail.com <javascript:;>

Let me again summarize the problem we are trying to address.

How will you take care of extra WAL on old master during

recovery. If it

plays the WAL which has not reached new-master, it can be a

problem.

you means that there is possible that old master's data ahead of

new

master's data.

I mean to say is that WAL of old master can be ahead of new

master. I understood that

data files of old master can't be ahead, but I think WAL can be

ahead.

so there is inconsistent data between those server when fail back.

right?

if so , there is not possible inconsistent. because if you use GUC

option

as his propose (i.g., failback_safe_standby_mode = remote_flush),
when old master is working fine, all file system level changes

aren't

done before WAL replicated.

Would the propose patch will take care that old master's WAL is

also not ahead in some way?

If yes, I think i am missing some point.

yes it will happen that old master's WAL ahead of new master's WAL

as you said.

but I think that we can solve them by delete all WAL file when old
master starts as new standby.

I think ideally, it should reset WAL location at the point where new

master has forrked off.

In such a scenario it would be difficult for user who wants to get a

dump of some data in

old master which hasn't gone to new master. I am not sure if such a

need is there for real users, but if it

is there, then providing this solution will have some drawbacks.

I think that we can dumping data before all WAL files deleting. All
WAL files deleting is done when old master starts as new standby.

Can we dump data without starting server?

Sorry I made a mistake. We can't it.

this proposing patch need to be able to also handle such scenario in
future.

Regards,

---
Sawada Masahiko

--
Regards,

-------
Sawada Masahiko

Import Notes

Reply to msg id not found: 51bfe7df.412d0f0a.76f2.6163SMTPIN_ADDED_BROKEN@mx.google.com

#41

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Sawada Masahiko (#40)

Re: Patch for fail-back without fresh backup

On Wednesday, June 19, 2013 10:45 PM Sawada Masahiko wrote:
On Tuesday, June 18, 2013, Amit Kapila wrote:
On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:

On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com>
wrote:

On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila

<amit.kapila@huawei.com> wrote:

On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila

<amit.kapila@huawei.com> wrote:

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

I think that we can dumping data before all WAL files deleting. All
WAL files deleting is done when old master starts as new standby.

Can we dump data without starting server?

Sorry I made a mistake. We can't it.

this proposing patch need to be able to also handle such scenario in future.

I am not sure the purposed patch can handle it so easily, but I think if others also felt it important, then a method should be a
provided to user for extracting his last committed data.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Simon Riggs (#36)

Re: Patch for fail-back without fresh backup

On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I agree. We should probably find a better name for this. Any suggestions ?

err, I already made one...

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

Since commits are more foreground in nature and this feature
does not require us to wait during common foreground activities, we want a
configuration where master can wait for synchronous transfers at other than
commits. May we can solve that by having more granular control to the said
parameter ?

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

How would we then distinguish between synchronous and the new kind of
standby ?

That's not the point. The point is "Why would we have a new kind of
standby?" and therefore why do we need new parameters?

I am told, one of the very popular setups for DR is to have one
local sync standby and one async (may be cascaded by the local sync). Since
this new feature is more useful for DR because taking a fresh backup on a
slower link is even more challenging, IMHO we should support such setups.

...which still doesn't make sense to me. Lets look at that in detail.

Take 3 servers, A, B, C with A and B being linked by sync rep, and C
being safety standby at a distance.

Either A or B is master, except in disaster. So if A is master, then B
would be the failover target. If A fails, then you want to failover to
B. Once B is the target, you want to failback to A as the master. C
needs to follow the new master, whichever it is.

If you set up sync rep between A and B and this new mode between A and
C. When B becomes the master, you need to failback from B from A, but
you can't because the new mode applied between A and C only, so you
have to failback from C to A. So having the new mode not match with
sync rep means you are forcing people to failback using the slow link
in the common case.

You might observe that having the two modes match causes problems if A
and B fail, so you are forced to go to C as master and then eventually
failback to A or B across a slow link. That case is less common and
could be solved by extending sync transfer to more/multi nodes.

It definitely doesn't make sense to have sync rep on anything other
than a subset of sync transfer. So while it may be sensible in the
future to make sync transfer a superset of sync rep nodes, it makes
sense to make them the same config for now.

when 2 servers being synchronous replication, those servers are in
same location in many cases. ( e.g., same server room)
so taking a full backup and sending it to old master is not issue.
this proposal works for situation which those servers are put in
remote location and when main site is powered down due to such as
power failure or natural disaster occurs.
as you said, we can control file (e.g., CLOG, pg_control, etc)
replicating by adding synchronous_transfer option.
but if to add only this parameter, we can handle only following 2 cases.

1. synchronous standby and make same as failback safe standby
2. asynchronous standby and make same as failback safe standby

in above case, adding new parameter might be meaningless. but I think
that we should handle case not only case 1,2 but also following case
3, 4 for DR.

3. synchronous standby and make different asynchronous failback safe standby
4. asynchronous standby and make different asynchronous failback safe standby

To handles following case 3 and 4, we should set parameter to each
standby. so we need to adding new parameter.
if we can structure replication in such situation, replication would
be more useful for user in slow link.

parameter improvement idea is which we extend ini file for to set
parameter each standby. For example :

--------------------
[Server]
standby_name = 'slave1'
synchronous_transfer = commit
wal_sender_timeout = 30
[Server]
standby_name = 'slave2'
synchronous_transfer = all
wal_sender_timeout = 50
-------------------

there are discussions about such ini file in past. if so, we can set
each parameter to each standby.

please give me feedback.

Regards,
-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Amit Langote

amitlangote09@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#42)

Re: Patch for fail-back without fresh backup

Hi,

parameter improvement idea is which we extend ini file for to set
parameter each standby. For example :

--------------------
[Server]
standby_name = 'slave1'
synchronous_transfer = commit
wal_sender_timeout = 30
[Server]
standby_name = 'slave2'
synchronous_transfer = all
wal_sender_timeout = 50
-------------------

Just ask to clarify:

Is 'slave2' a failback standby?
What does 'synchronous_transfer = all' mean? Does that mean wait
during both commit and checkpoint?

--
Amit Langote

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Pavan Deolasee

pavan.deolasee@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#42)

Re: Patch for fail-back without fresh backup

On Mon, Jun 24, 2013 at 7:17 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:

--------------------
[Server]
standby_name = 'slave1'
synchronous_transfer = commit
wal_sender_timeout = 30
[Server]
standby_name = 'slave2'
synchronous_transfer = all
wal_sender_timeout = 50
-------------------

What different values/modes you are thinking for synchronous_transfer ?
IMHO only "commit" and "all" may not be enough. As I suggested upthread, we
may need an additional mode, say "data", which will ensure synchronous WAL
transfer before making any file system changes. We need this separate mode
because the failback safe (or whatever we call it) standby need not wait on
the commits and it's important to avoid that wait since it comes in a
direct path of client transactions.

If we are doing it, I wonder if an additional mode "none" also makes sense
so that users can also control asynchronous standbys via the same mechanism.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#45

Amit Langote

amitlangote09@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

Hi,

So our proposal on this problem is that we must ensure that master should

not make any file system level changes without confirming that the

corresponding WAL record is replicated to the standby.

How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

I am trying to understand how there would be extra WAL on old master
that it would replay and cause inconsistency. Consider how I am
picturing it and correct me if I am wrong.

1) Master crashes. So a failback standby becomes new master forking the WAL.
2) Old master is restarted as a standby (now with this patch, without
a new base backup).
3) It would try to replay all the WAL it has available and later
connect to the new master also following the timeline switch (the
switch might happen using archived WAL and timeline history file OR
the new switch-over-streaming-replication-connection as of 9.3,
right?)

* in (3), when the new standby/old master is replaying WAL, from where
is it picking the WAL? Does it first replay all the WAL in pg_xlog
before archive? Should we make it check for a timeline history file in
archive before it starts replaying any WAL?

* And, would the new master, before forking the WAL, replay all the
WAL that is necessary to come to state (of data directory) that the
old master was just before it crashed?

Am I missing something here?

--
Amit Langote

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 51bb17d2.06360f0a.4bc2.ffffcfc5SMTPIN_ADDED_BROKEN@mx.google.com

#46

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Pavan Deolasee (#44)

Re: Patch for fail-back without fresh backup

On Tue, Jun 25, 2013 at 12:19 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Mon, Jun 24, 2013 at 7:17 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:

--------------------
[Server]
standby_name = 'slave1'
synchronous_transfer = commit
wal_sender_timeout = 30
[Server]
standby_name = 'slave2'
synchronous_transfer = all
wal_sender_timeout = 50
-------------------

What different values/modes you are thinking for synchronous_transfer ? IMHO
only "commit" and "all" may not be enough. As I suggested upthread, we may
need an additional mode, say "data", which will ensure synchronous WAL
transfer before making any file system changes. We need this separate mode
because the failback safe (or whatever we call it) standby need not wait on
the commits and it's important to avoid that wait since it comes in a direct
path of client transactions.

If we are doing it, I wonder if an additional mode "none" also makes sense
so that users can also control asynchronous standbys via the same mechanism.

I made mistake how to use name of parameter between
synchronous_transfer and failback_safe_standby_mode.
it means that we control file system changes using
failback_safe_standby_mode. if failback_safe_standby_mode is set
'remote_flush', master server wait for flushing all data page in
standby server (e.g., CLOG, pg_control).
right?
for example:
------
[server]
standby_name = 'slave1'
failback_safe_standby_mode = remote_flush
wal_sender_timeout = 50
------

in this case, we should also set synchronous_commit and
synchronous_level to each standby server. that is, do we need to set
following 3 parameters for supporting case 3,4 as I said?
-synchronous_commit = on/off/local/remote_write
-failback_safe_standby_mode = off/remote_write/remote_flush
-synchronous_level = sync/async (this parameter means that standby
server is connected using which mode (sync/async) .)

please give me your feedback.

Regards,
-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Amit Langote (#45)

Re: Patch for fail-back without fresh backup

On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:

Hi,

So our proposal on this problem is that we must ensure that master

should

not make any file system level changes without confirming that the

corresponding WAL record is replicated to the standby.

How will you take care of extra WAL on old master during recovery.

If it

plays the WAL which has not reached new-master, it can be a problem.

I am trying to understand how there would be extra WAL on old master
that it would replay and cause inconsistency. Consider how I am
picturing it and correct me if I am wrong.

1) Master crashes. So a failback standby becomes new master forking the
WAL.
2) Old master is restarted as a standby (now with this patch, without
a new base backup).
3) It would try to replay all the WAL it has available and later
connect to the new master also following the timeline switch (the
switch might happen using archived WAL and timeline history file OR
the new switch-over-streaming-replication-connection as of 9.3,
right?)

* in (3), when the new standby/old master is replaying WAL, from where
is it picking the WAL?

Yes, this is the point which can lead to inconsistency, new standby/old master
will replay WAL after the last successful checkpoint, for which he get info from
control file. It is picking WAL from the location where it was logged when it was active (pg_xlog).

Does it first replay all the WAL in pg_xlog
before archive? Should we make it check for a timeline history file in
archive before it starts replaying any WAL?

I have really not thought what is best solution for problem.

* And, would the new master, before forking the WAL, replay all the
WAL that is necessary to come to state (of data directory) that the
old master was just before it crashed?

I don't think new master has any correlation with old master's data directory,
Rather it will replay the WAL it has received/flushed before start acting as master.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#42)

Re: Patch for fail-back without fresh backup

On Mon, Jun 24, 2013 at 10:47 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

1. synchronous standby and make same as failback safe standby
2. asynchronous standby and make same as failback safe standby

in above case, adding new parameter might be meaningless. but I think
that we should handle case not only case 1,2 but also following case
3, 4 for DR.

to support case 1 and 2, I'm thinking that following another 2 ideas.

------------------------
we add synchronous_transfer( commit/ data_flush/ all) . This GUC will
only affect the standbys mentioned in the list of
synchronous_standby_names.

1. If synchronous_transfer is set to commit, current synchronous
replication behavior is achieved
2. If synchronous_transfer is set to data_flush, the standbys named in
synchronous_standby_names will act as ASYNC failback safe standbys
3. If synchronous_transfer is set to all, the standbys named in
synchronous_standby_names will act as SYNC failback safe standbys

in this approach, 3 is confusing because we are actually setting up a
ASYNC standby by using the GUCs meant for sync standby setup.

-------------------------
we extend synchronous_commit so that it also accepts like 'all'. (
this approach dosen't provide 'synchronous_transfer' parameter)
'all' value means that master wait for not only replicated WAL but
also replicated data page (e.g., CLOG, pg_control). and master changes
the process by whether standby is connected as sync or async.

1. If synchronous_commit is set to 'all' and synchronous_standby_name
is set to standby name, the standbys named in
synchronous_standby_names will act as SYNC failback safe standby.
2. If synchronous_commit is set to 'all' and synchronous_standby_name
is NOT set to standby name, the standbys which is connecting to master
will act as ASYNC failback safe standby.

one problem with not naming ASYNC standby explicitly is that the
master has no clue which standby to wait on.
If it chooses to wait on all async standbys for failback-safety that
can be quite detrimental, especially because async standbys can become
easily unreachable if they are on a slow link or at remote location.

please give me feedback.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Simon Riggs (#36)

Re: Patch for fail-back without fresh backup

On Mon, Jun 17, 2013 at 7:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I am told, one of the very popular setups for DR is to have one
local sync standby and one async (may be cascaded by the local sync). Since
this new feature is more useful for DR because taking a fresh backup on a
slower link is even more challenging, IMHO we should support such setups.

...which still doesn't make sense to me. Lets look at that in detail.

Take 3 servers, A, B, C with A and B being linked by sync rep, and C
being safety standby at a distance.

Either A or B is master, except in disaster. So if A is master, then B
would be the failover target. If A fails, then you want to failover to
B. Once B is the target, you want to failback to A as the master. C
needs to follow the new master, whichever it is.

If you set up sync rep between A and B and this new mode between A and
C. When B becomes the master, you need to failback from B from A, but
you can't because the new mode applied between A and C only, so you
have to failback from C to A. So having the new mode not match with
sync rep means you are forcing people to failback using the slow link
in the common case.

It's true that in this scenario that doesn't really make sense, but I
still think they are separate properties. You could certainly want
synchronous replication without this new property, if you like the
data-loss guarantees that sync rep provides but don't care about
failback. You could also want this new property without synchronous
replication, if you don't need the data-loss guarantees that sync rep
provides but you do care about fast failback. I admit it seems
unlikely that you would use both features but not target them at the
same machines, although maybe: perhaps you have a sync standby and an
async standby and want this new property with respect to both of them.

In my admittedly limited experience, the use case for a lot of this
technology is in the cloud. The general strategy seems to be: at the
first sign of trouble, kill the offending instance and fail over.
This can result in failing over pretty frequently, and needing it to
be fast. There may be no real hardware problem; indeed, the failover
may be precipitated by network conditions or overload of the physical
host backing the virtual machine or any number of other nonphysical
problems. I can see this being useful in that environment, even for
async standbys. People can apparently tolerate a brief interruption
while their primary gets killed off and connections are re-established
with the new master, but they need the failover to be fast. The
problem with the status quo is that, even if the first failover is
fast, the second one isn't, because it has to wait behind rebuilding
the original master.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:

Hi,

So our proposal on this problem is that we must ensure that master

should

not make any file system level changes without confirming that the

corresponding WAL record is replicated to the standby.

How will you take care of extra WAL on old master during recovery.

If it

plays the WAL which has not reached new-master, it can be a problem.

I am trying to understand how there would be extra WAL on old master
that it would replay and cause inconsistency. Consider how I am
picturing it and correct me if I am wrong.

1) Master crashes. So a failback standby becomes new master forking the
WAL.
2) Old master is restarted as a standby (now with this patch, without
a new base backup).
3) It would try to replay all the WAL it has available and later
connect to the new master also following the timeline switch (the
switch might happen using archived WAL and timeline history file OR
the new switch-over-streaming-replication-connection as of 9.3,
right?)

* in (3), when the new standby/old master is replaying WAL, from where
is it picking the WAL?

Yes, this is the point which can lead to inconsistency, new standby/old master
will replay WAL after the last successful checkpoint, for which he get info from
control file. It is picking WAL from the location where it was logged when it was active (pg_xlog).

Does it first replay all the WAL in pg_xlog
before archive? Should we make it check for a timeline history file in
archive before it starts replaying any WAL?

I have really not thought what is best solution for problem.

* And, would the new master, before forking the WAL, replay all the
WAL that is necessary to come to state (of data directory) that the
old master was just before it crashed?

I don't think new master has any correlation with old master's data directory,
Rather it will replay the WAL it has received/flushed before start acting as master.

when old master fail over, WAL which ahead of new master might be
broken data. so that when user want to dump from old master, there is
possible to fail dump.
it is just idea, we extend parameter which is used in recovery.conf
like 'follow_master_force'. this parameter accepts 'on' and 'off', is
effective only when standby_mode is set to on.

if both parameters 'follow_master_force' and 'standby_mode' is set to 'on',
1. when standby server starts and starts to recovery, standby server
skip to apply WAL which is in pg_xlog, and request WAL from latest
checkpoint LSN to master server.
2. master server receives LSN which is standby server latest
checkpoint, and compare between LSN of standby and LSN of master
latest checkpoint. if those LSN match, master will send WAL from
latest checkpoint LSN. if not, master will inform standby that failed.
3. standby will fork WAL, and apply WAL which is sent from master continuity.

in this approach, user who want to dump from old master will set 'off'
to follow_master_force and standby_mode, and gets the dump of old
master after master started. OTOH, user who want to starts replication
force will set 'on' to both parameter.

please give me feedback.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 51ca70f4.07ae0e0a.7fd3.ffffc940SMTPIN_ADDED_BROKEN@mx.google.com

#51

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Sawada Masahiko (#50)

Re: Patch for fail-back without fresh backup

On Friday, June 28, 2013 10:41 AM Sawada Masahiko wrote:

On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:

Hi,

So our proposal on this problem is that we must ensure that

master

should

not make any file system level changes without confirming that the

corresponding WAL record is replicated to the standby.

How will you take care of extra WAL on old master during

recovery.

If it

plays the WAL which has not reached new-master, it can be a

problem.

I am trying to understand how there would be extra WAL on old master
that it would replay and cause inconsistency. Consider how I am
picturing it and correct me if I am wrong.

1) Master crashes. So a failback standby becomes new master forking

the

WAL.
2) Old master is restarted as a standby (now with this patch,

without

a new base backup).
3) It would try to replay all the WAL it has available and later
connect to the new master also following the timeline switch (the
switch might happen using archived WAL and timeline history file OR
the new switch-over-streaming-replication-connection as of 9.3,
right?)

* in (3), when the new standby/old master is replaying WAL, from

where

is it picking the WAL?

Yes, this is the point which can lead to inconsistency, new

standby/old master

will replay WAL after the last successful checkpoint, for which he

get info from

control file. It is picking WAL from the location where it was

logged when it was active (pg_xlog).

Does it first replay all the WAL in pg_xlog
before archive? Should we make it check for a timeline history file

in

archive before it starts replaying any WAL?

I have really not thought what is best solution for problem.

* And, would the new master, before forking the WAL, replay all the
WAL that is necessary to come to state (of data directory) that the
old master was just before it crashed?

I don't think new master has any correlation with old master's data

directory,

Rather it will replay the WAL it has received/flushed before start

acting as master.
when old master fail over, WAL which ahead of new master might be
broken data. so that when user want to dump from old master, there is
possible to fail dump.
it is just idea, we extend parameter which is used in recovery.conf
like 'follow_master_force'. this parameter accepts 'on' and 'off', is
effective only when standby_mode is set to on.

if both parameters 'follow_master_force' and 'standby_mode' is set to
'on',
1. when standby server starts and starts to recovery, standby server
skip to apply WAL which is in pg_xlog, and request WAL from latest
checkpoint LSN to master server.
2. master server receives LSN which is standby server latest
checkpoint, and compare between LSN of standby and LSN of master
latest checkpoint. if those LSN match, master will send WAL from
latest checkpoint LSN. if not, master will inform standby that failed.
3. standby will fork WAL, and apply WAL which is sent from master
continuity.

Please consider if this solution has the same problem as mentioned by Robert Hass in below mail:
/messages/by-id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com

in this approach, user who want to dump from old master will set 'off'
to follow_master_force and standby_mode, and gets the dump of old
master after master started. OTOH, user who want to starts replication
force will set 'on' to both parameter.

I think before going into solution of this problem, it should be confirmed by others whether such a problem
needs to be resolved as part of this patch.

I have seen that Simon Riggs is a reviewer of this Patch and he hasn't mentioned his views about this problem.
So I think it's not worth inventing a solution.

Rather I think if all other things are resolved for this patch, then may be in end we can check with Committer,
if he thinks that this problem needs to be solved as a separate patch.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#1)

Re: Patch for fail-back without fresh backup

On Tue, Jul 2, 2013 at 2:45 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, June 28, 2013 10:41 AM Sawada Masahiko wrote:

On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:

Hi,

So our proposal on this problem is that we must ensure that

master

should

not make any file system level changes without confirming that the

corresponding WAL record is replicated to the standby.

How will you take care of extra WAL on old master during

recovery.

If it

plays the WAL which has not reached new-master, it can be a

problem.

I am trying to understand how there would be extra WAL on old master
that it would replay and cause inconsistency. Consider how I am
picturing it and correct me if I am wrong.

1) Master crashes. So a failback standby becomes new master forking

the

WAL.
2) Old master is restarted as a standby (now with this patch,

without

a new base backup).
3) It would try to replay all the WAL it has available and later
connect to the new master also following the timeline switch (the
switch might happen using archived WAL and timeline history file OR
the new switch-over-streaming-replication-connection as of 9.3,
right?)

* in (3), when the new standby/old master is replaying WAL, from

where

is it picking the WAL?

Yes, this is the point which can lead to inconsistency, new

standby/old master

will replay WAL after the last successful checkpoint, for which he

get info from

control file. It is picking WAL from the location where it was

logged when it was active (pg_xlog).

Does it first replay all the WAL in pg_xlog
before archive? Should we make it check for a timeline history file

in

archive before it starts replaying any WAL?

I have really not thought what is best solution for problem.

* And, would the new master, before forking the WAL, replay all the
WAL that is necessary to come to state (of data directory) that the
old master was just before it crashed?

I don't think new master has any correlation with old master's data

directory,

Rather it will replay the WAL it has received/flushed before start

acting as master.
when old master fail over, WAL which ahead of new master might be
broken data. so that when user want to dump from old master, there is
possible to fail dump.
it is just idea, we extend parameter which is used in recovery.conf
like 'follow_master_force'. this parameter accepts 'on' and 'off', is
effective only when standby_mode is set to on.

if both parameters 'follow_master_force' and 'standby_mode' is set to
'on',
1. when standby server starts and starts to recovery, standby server
skip to apply WAL which is in pg_xlog, and request WAL from latest
checkpoint LSN to master server.
2. master server receives LSN which is standby server latest
checkpoint, and compare between LSN of standby and LSN of master
latest checkpoint. if those LSN match, master will send WAL from
latest checkpoint LSN. if not, master will inform standby that failed.
3. standby will fork WAL, and apply WAL which is sent from master
continuity.

Please consider if this solution has the same problem as mentioned by Robert Hass in below mail:
/messages/by-id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com

in this approach, user who want to dump from old master will set 'off'
to follow_master_force and standby_mode, and gets the dump of old
master after master started. OTOH, user who want to starts replication
force will set 'on' to both parameter.

I think before going into solution of this problem, it should be confirmed by others whether such a problem
needs to be resolved as part of this patch.

I have seen that Simon Riggs is a reviewer of this Patch and he hasn't mentioned his views about this problem.
So I think it's not worth inventing a solution.
Rather I think if all other things are resolved for this patch, then may be in end we can check with Committer,
if he thinks that this problem needs to be solved as a separate patch.

thank you for feedback.
yes, we can consider separately those problem. and we need to judge
that whether it is worth to invent a solution.
I think that solving the fundamental of this problem is complex. it
might be needs to big change to architecture of replication.
so I'm thinking that I'd like to deal of something when we do
recovery. if so, I think that if we deal at recovery time, impact to
performance is ignored.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 51d26907.61fc440a.39a7.1306SMTPIN_ADDED_BROKEN@mx.google.com

#53

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Amit Kapila (#51)

Re: Patch for fail-back without fresh backup

On Tuesday, July 02, 2013 11:16 AM Amit Kapila wrote:

On Friday, June 28, 2013 10:41 AM Sawada Masahiko wrote:

On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com>
wrote:

On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:

Hi,

So our proposal on this problem is that we must ensure that

master

should

not make any file system level changes without confirming that

the

corresponding WAL record is replicated to the standby.

How will you take care of extra WAL on old master during

recovery.

If it

plays the WAL which has not reached new-master, it can be a

problem.

I am trying to understand how there would be extra WAL on old

master

that it would replay and cause inconsistency. Consider how I am
picturing it and correct me if I am wrong.

1) Master crashes. So a failback standby becomes new master

forking

the

WAL.
2) Old master is restarted as a standby (now with this patch,

without

a new base backup).
3) It would try to replay all the WAL it has available and later
connect to the new master also following the timeline switch (the
switch might happen using archived WAL and timeline history file

OR

the new switch-over-streaming-replication-connection as of 9.3,
right?)

* in (3), when the new standby/old master is replaying WAL, from

where

is it picking the WAL?

Yes, this is the point which can lead to inconsistency, new

standby/old master

will replay WAL after the last successful checkpoint, for which

he

get info from

control file. It is picking WAL from the location where it was

logged when it was active (pg_xlog).

Does it first replay all the WAL in pg_xlog
before archive? Should we make it check for a timeline history

file

in

archive before it starts replaying any WAL?

I have really not thought what is best solution for problem.

* And, would the new master, before forking the WAL, replay all

the

WAL that is necessary to come to state (of data directory) that

the

old master was just before it crashed?

I don't think new master has any correlation with old master's data

directory,

Rather it will replay the WAL it has received/flushed before start

acting as master.
when old master fail over, WAL which ahead of new master might be
broken data. so that when user want to dump from old master, there is
possible to fail dump.
it is just idea, we extend parameter which is used in recovery.conf
like 'follow_master_force'. this parameter accepts 'on' and 'off', is
effective only when standby_mode is set to on.

if both parameters 'follow_master_force' and 'standby_mode' is set to
'on',
1. when standby server starts and starts to recovery, standby server
skip to apply WAL which is in pg_xlog, and request WAL from latest
checkpoint LSN to master server.
2. master server receives LSN which is standby server latest
checkpoint, and compare between LSN of standby and LSN of master
latest checkpoint. if those LSN match, master will send WAL from
latest checkpoint LSN. if not, master will inform standby that

failed.

3. standby will fork WAL, and apply WAL which is sent from master
continuity.

Please consider if this solution has the same problem as mentioned by
Robert Hass in below mail:

Sorry typo error, it's Robert Haas mail:

http://www.postgresql.org/message-
id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com

in this approach, user who want to dump from old master will set

'off'

to follow_master_force and standby_mode, and gets the dump of old
master after master started. OTOH, user who want to starts

replication

force will set 'on' to both parameter.

I think before going into solution of this problem, it should be
confirmed by others whether such a problem
needs to be resolved as part of this patch.

I have seen that Simon Riggs is a reviewer of this Patch and he hasn't
mentioned his views about this problem.
So I think it's not worth inventing a solution.

Rather I think if all other things are resolved for this patch, then
may be in end we can check with Committer,
if he thinks that this problem needs to be solved as a separate patch.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Simon Riggs (#36)

Re: Patch for fail-back without fresh backup

On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I agree. We should probably find a better name for this. Any suggestions ?

err, I already made one...

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

Since commits are more foreground in nature and this feature
does not require us to wait during common foreground activities, we want a
configuration where master can wait for synchronous transfers at other than
commits. May we can solve that by having more granular control to the said
parameter ?

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

How would we then distinguish between synchronous and the new kind of
standby ?

That's not the point. The point is "Why would we have a new kind of
standby?" and therefore why do we need new parameters?

I am told, one of the very popular setups for DR is to have one
local sync standby and one async (may be cascaded by the local sync). Since
this new feature is more useful for DR because taking a fresh backup on a
slower link is even more challenging, IMHO we should support such setups.

...which still doesn't make sense to me. Lets look at that in detail.

Take 3 servers, A, B, C with A and B being linked by sync rep, and C
being safety standby at a distance.

Either A or B is master, except in disaster. So if A is master, then B
would be the failover target. If A fails, then you want to failover to
B. Once B is the target, you want to failback to A as the master. C
needs to follow the new master, whichever it is.

If you set up sync rep between A and B and this new mode between A and
C. When B becomes the master, you need to failback from B from A, but
you can't because the new mode applied between A and C only, so you
have to failback from C to A. So having the new mode not match with
sync rep means you are forcing people to failback using the slow link
in the common case.

You might observe that having the two modes match causes problems if A
and B fail, so you are forced to go to C as master and then eventually
failback to A or B across a slow link. That case is less common and
could be solved by extending sync transfer to more/multi nodes.

It definitely doesn't make sense to have sync rep on anything other
than a subset of sync transfer. So while it may be sensible in the
future to make sync transfer a superset of sync rep nodes, it makes
sense to make them the same config for now.

I have updated the patch.

we support following 2 cases.
1. SYNC server and also make same failback safe standby server
2. ASYNC server and also make same failback safe standby server

1. changed name of parameter
give up 'failback_safe_standby_names' parameter from the first patch.
and changed name of parameter from 'failback_safe_mode ' to
'synchronous_transfer'.
this parameter accepts 'all', 'data_flush' and 'commit'.

-'commit'
'commit' means that master waits for corresponding WAL to flushed
to disk of standby server on commits.
but master doesn't waits for replicated data pages.

-'data_flush'
'data_flush' means that master waits for replicated data page
(e.g, CLOG, pg_control) before flush to disk of master server.
but if user set to 'data_flush' to this parameter,
'synchronous_commit' values is ignored even if user set
'synchronous_commit'.

-'all'
'all' means that master waits for replicated WAL and data page.

2. put SyncRepWaitForLSN() function into XLogFlush() function
we have put SyncRepWaitForLSN() function into XLogFlush() function,
and change argument of XLogFlush().

they are setup case and need to set parameters.

- SYNC server and also make same failback safe standgy server (case 1)
synchronous_transfer = all
synchronous_commit = remote_write/on
synchronous_standby_names = <ServerName>

- ASYNC server and also make same failback safe standgy server (case 2)
synchronous_transfer = data_flush
(synchronous_commit values is ignored)

- default SYNC replication
synchronous_transfer = commit
synchronous_commit = on
synchronous_standby_names = <ServerName>

- default ASYNC replication
synchronous_transfer = commit

ToDo
1. currently this patch supports synchronous transfer. so we can't set
different synchronous transfer mode to each server.
we need to improve the patch for support following cases.
- SYNC standby and make separate ASYNC failback safe standby
- ASYNC standby and make separate ASYNC failback safe standby

2. we have not measure performance yet. we need to measure perfomance.

please give me your feedback.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#54)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Sun, Jul 7, 2013 at 4:19 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I agree. We should probably find a better name for this. Any suggestions ?

err, I already made one...

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

Since commits are more foreground in nature and this feature
does not require us to wait during common foreground activities, we want a
configuration where master can wait for synchronous transfers at other than
commits. May we can solve that by having more granular control to the said
parameter ?

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

How would we then distinguish between synchronous and the new kind of
standby ?

That's not the point. The point is "Why would we have a new kind of
standby?" and therefore why do we need new parameters?

I am told, one of the very popular setups for DR is to have one
local sync standby and one async (may be cascaded by the local sync). Since
this new feature is more useful for DR because taking a fresh backup on a
slower link is even more challenging, IMHO we should support such setups.

...which still doesn't make sense to me. Lets look at that in detail.

Take 3 servers, A, B, C with A and B being linked by sync rep, and C
being safety standby at a distance.

Either A or B is master, except in disaster. So if A is master, then B
would be the failover target. If A fails, then you want to failover to
B. Once B is the target, you want to failback to A as the master. C
needs to follow the new master, whichever it is.

If you set up sync rep between A and B and this new mode between A and
C. When B becomes the master, you need to failback from B from A, but
you can't because the new mode applied between A and C only, so you
have to failback from C to A. So having the new mode not match with
sync rep means you are forcing people to failback using the slow link
in the common case.

You might observe that having the two modes match causes problems if A
and B fail, so you are forced to go to C as master and then eventually
failback to A or B across a slow link. That case is less common and
could be solved by extending sync transfer to more/multi nodes.

It definitely doesn't make sense to have sync rep on anything other
than a subset of sync transfer. So while it may be sensible in the
future to make sync transfer a superset of sync rep nodes, it makes
sense to make them the same config for now.

I have updated the patch.

we support following 2 cases.
1. SYNC server and also make same failback safe standby server
2. ASYNC server and also make same failback safe standby server

1. changed name of parameter
give up 'failback_safe_standby_names' parameter from the first patch.
and changed name of parameter from 'failback_safe_mode ' to
'synchronous_transfer'.
this parameter accepts 'all', 'data_flush' and 'commit'.

-'commit'
'commit' means that master waits for corresponding WAL to flushed
to disk of standby server on commits.
but master doesn't waits for replicated data pages.

-'data_flush'
'data_flush' means that master waits for replicated data page
(e.g, CLOG, pg_control) before flush to disk of master server.
but if user set to 'data_flush' to this parameter,
'synchronous_commit' values is ignored even if user set
'synchronous_commit'.

-'all'
'all' means that master waits for replicated WAL and data page.

2. put SyncRepWaitForLSN() function into XLogFlush() function
we have put SyncRepWaitForLSN() function into XLogFlush() function,
and change argument of XLogFlush().

they are setup case and need to set parameters.

- SYNC server and also make same failback safe standgy server (case 1)
synchronous_transfer = all
synchronous_commit = remote_write/on
synchronous_standby_names = <ServerName>

- ASYNC server and also make same failback safe standgy server (case 2)
synchronous_transfer = data_flush
(synchronous_commit values is ignored)

- default SYNC replication
synchronous_transfer = commit
synchronous_commit = on
synchronous_standby_names = <ServerName>

- default ASYNC replication
synchronous_transfer = commit

ToDo
1. currently this patch supports synchronous transfer. so we can't set
different synchronous transfer mode to each server.
we need to improve the patch for support following cases.
- SYNC standby and make separate ASYNC failback safe standby
- ASYNC standby and make separate ASYNC failback safe standby

2. we have not measure performance yet. we need to measure perfomance.

please give me your feedback.

Regards,

-------
Sawada Masahiko

I'm sorry. I forgot attached the patch.
Please see the attached file.

Regards,

-------
Sawada Masahiko

Attachments:

failback_safe_standby_v2.patchapplication/octet-stream; name=failback_safe_standby_v2.patchDownload

*** a/src/backend/access/transam/clog.c
--- b/src/backend/access/transam/clog.c
***************
*** 722,728 **** WriteTruncateXlogRec(int pageno)
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
  	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
! 	XLogFlush(recptr);
  }
  
  /*
--- 722,728 ----
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
  	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
! 	XLogFlush(recptr, true, false);
  }
  
  /*
*** a/src/backend/access/transam/slru.c
--- b/src/backend/access/transam/slru.c
***************
*** 696,704 **** SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
  			 * XLogFlush were to fail, we must PANIC.  This isn't much of a
  			 * restriction because XLogFlush is just about all critical
  			 * section anyway, but let's make sure.
  			 */
  			START_CRIT_SECTION();
! 			XLogFlush(max_lsn);
  			END_CRIT_SECTION();
  		}
  	}
--- 696,706 ----
  			 * XLogFlush were to fail, we must PANIC.  This isn't much of a
  			 * restriction because XLogFlush is just about all critical
  			 * section anyway, but let's make sure.
+ 			 * Also wait for the synchronous standby to receive WAL upto
+ 			 * max_lsn.
  			 */
  			START_CRIT_SECTION();
! 			XLogFlush(max_lsn, true, true);
  			END_CRIT_SECTION();
  		}
  	}
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1049,1055 **** EndPrepare(GlobalTransaction gxact)
  
  	gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
  									records.head);
! 	XLogFlush(gxact->prepare_lsn);
  
  	/* If we crash now, we have prepared: WAL replay will fix things */
  
--- 1049,1062 ----
  
  	gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
  									records.head);
! 
! 	/*
! 	 * Wait for synchronous replication, if required.
! 	 *
! 	 * Note that at this stage we have marked the prepare, but still show as
! 	 * running in the procarray (twice!) and continue to hold locks.
! 	 */
! 	XLogFlush(gxact->prepare_lsn, false, true);
  
  	/* If we crash now, we have prepared: WAL replay will fix things */
  
***************
*** 1090,1103 **** EndPrepare(GlobalTransaction gxact)
  
  	END_CRIT_SECTION();
  
- 	/*
- 	 * Wait for synchronous replication, if required.
- 	 *
- 	 * Note that at this stage we have marked the prepare, but still show as
- 	 * running in the procarray (twice!) and continue to hold locks.
- 	 */
- 	SyncRepWaitForLSN(gxact->prepare_lsn);
- 
  	records.tail = records.head = NULL;
  }
  
--- 1097,1102 ----
***************
*** 2046,2053 **** RecordTransactionCommitPrepared(TransactionId xid,
  	 * a contradiction)
  	 */
  
! 	/* Flush XLOG to disk */
! 	XLogFlush(recptr);
  
  	/* Mark the transaction committed in pg_clog */
  	TransactionIdCommitTree(xid, nchildren, children);
--- 2045,2058 ----
  	 * a contradiction)
  	 */
  
! 	/*
! 	 * Flush XLOG to disk,
! 	 * Wait for synchronous replication, if required.
! 	 *
! 	 * Note that at this stage we have marked clog, but still show as running
! 	 * in the procarray and continue to hold locks.
! 	 */
! 	XLogFlush(recptr, false, true);
  
  	/* Mark the transaction committed in pg_clog */
  	TransactionIdCommitTree(xid, nchildren, children);
***************
*** 2056,2069 **** RecordTransactionCommitPrepared(TransactionId xid,
  	MyPgXact->delayChkpt = false;
  
  	END_CRIT_SECTION();
- 
- 	/*
- 	 * Wait for synchronous replication, if required.
- 	 *
- 	 * Note that at this stage we have marked clog, but still show as running
- 	 * in the procarray and continue to hold locks.
- 	 */
- 	SyncRepWaitForLSN(recptr);
  }
  
  /*
--- 2061,2066 ----
***************
*** 2126,2133 **** RecordTransactionAbortPrepared(TransactionId xid,
  
  	recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT_PREPARED, rdata);
  
! 	/* Always flush, since we're about to remove the 2PC state file */
! 	XLogFlush(recptr);
  
  	/*
  	 * Mark the transaction aborted in clog.  This is not absolutely necessary
--- 2123,2136 ----
  
  	recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT_PREPARED, rdata);
  
! 	/*
! 	 * Always flush, since we're about to remove the 2PC state file.
! 	 * Wait for synchronous replication, if required.
! 	 *
! 	 * Note that at this stage we have marked clog, but still show as running
! 	 * in the procarray and continue to hold locks.
! 	 */
! 	XLogFlush(recptr, false, true);
  
  	/*
  	 * Mark the transaction aborted in clog.  This is not absolutely necessary
***************
*** 2136,2147 **** RecordTransactionAbortPrepared(TransactionId xid,
  	TransactionIdAbortTree(xid, nchildren, children);
  
  	END_CRIT_SECTION();
- 
- 	/*
- 	 * Wait for synchronous replication, if required.
- 	 *
- 	 * Note that at this stage we have marked clog, but still show as running
- 	 * in the procarray and continue to hold locks.
- 	 */
- 	SyncRepWaitForLSN(recptr);
  }
--- 2139,2142 ----
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 965,970 **** RecordTransactionCommit(void)
--- 965,971 ----
  	SharedInvalidationMessage *invalMessages = NULL;
  	bool		RelcacheInitFileInval = false;
  	bool		wrote_xlog;
+ 	bool		ret = false;
  
  	/* Get data needed for commit record */
  	nrels = smgrGetPendingDeletes(true, &rels);
***************
*** 1143,1150 **** RecordTransactionCommit(void)
  	if ((wrote_xlog && synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
  		forceSyncCommit || nrels > 0)
  	{
! 		XLogFlush(XactLastRecEnd);
! 
  		/*
  		 * Now we may update the CLOG, if we wrote a COMMIT record above
  		 */
--- 1144,1151 ----
  	if ((wrote_xlog && synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
  		forceSyncCommit || nrels > 0)
  	{
! 		XLogFlush(XactLastRecEnd, false, true);
! 		ret = true;
  		/*
  		 * Now we may update the CLOG, if we wrote a COMMIT record above
  		 */
***************
*** 1195,1201 **** RecordTransactionCommit(void)
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
--- 1196,1202 ----
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd, false, !ret);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
***************
*** 4663,4669 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
  		 * after deletion, but that would leave a small window where the
  		 * WAL-first rule would be violated.
  		 */
! 		XLogFlush(lsn);
  
  		for (i = 0; i < nrels; i++)
  		{
--- 4664,4670 ----
  		 * after deletion, but that would leave a small window where the
  		 * WAL-first rule would be violated.
  		 */
! 		XLogFlush(lsn, true, true);
  
  		for (i = 0; i < nrels; i++)
  		{
***************
*** 4690,4696 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
  	 * for any user that requested ForceSyncCommit().
  	 */
  	if (XactCompletionForceSyncCommit(xinfo))
! 		XLogFlush(lsn);
  
  }
  
--- 4691,4697 ----
  	 * for any user that requested ForceSyncCommit().
  	 */
  	if (XactCompletionForceSyncCommit(xinfo))
! 		XLogFlush(lsn, true, false);
  
  }
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1898,1904 **** UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
   * already held, and we try to avoid acquiring it if possible.
   */
  void
! XLogFlush(XLogRecPtr record)
  {
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
--- 1898,1904 ----
   * already held, and we try to avoid acquiring it if possible.
   */
  void
! XLogFlush(XLogRecPtr record, bool ForDataFlush, bool Wait)
  {
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
***************
*** 2033,2038 **** XLogFlush(XLogRecPtr record)
--- 2033,2040 ----
  	/* wake up walsenders now that we've released heavily contended locks */
  	WalSndWakeupProcessRequests();
  
+ 	SyncRepWaitForLSN(WriteRqstPtr, ForDataFlush, Wait);
+ 
  	/*
  	 * If we still haven't flushed to the request point then we have a
  	 * problem; most likely, the requested flush point is past end of XLOG.
***************
*** 7092,7098 **** CreateCheckPoint(int flags)
  						XLOG_CHECKPOINT_ONLINE,
  						&rdata);
  
! 	XLogFlush(recptr);
  
  	/*
  	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
--- 7094,7105 ----
  						XLOG_CHECKPOINT_ONLINE,
  						&rdata);
  
! 	/*
! 	 * At this point, ensure that the synchronous standby has received the
! 	 * checkpoint WAL. Otherwise failure after the control file update will
! 	 * cause the master to start from a location not known to the standby
! 	 */
! 	XLogFlush(recptr, true, !shutdown);
  
  	/*
  	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
***************
*** 7249,7255 **** CreateEndOfRecoveryRecord(void)
  
  	recptr = XLogInsert(RM_XLOG_ID, XLOG_END_OF_RECOVERY, &rdata);
  
! 	XLogFlush(recptr);
  
  	/*
  	 * Update the control file so that crash recovery can follow the timeline
--- 7256,7262 ----
  
  	recptr = XLogInsert(RM_XLOG_ID, XLOG_END_OF_RECOVERY, &rdata);
  
! 	XLogFlush(recptr, true, true);
  
  	/*
  	 * Update the control file so that crash recovery can follow the timeline
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 285,293 **** RelationTruncate(Relation rel, BlockNumber nblocks)
  		 * or visibility map. If we crashed during that window, we'd be left
  		 * with a truncated heap, but the FSM or visibility map would still
  		 * contain entries for the non-existent heap pages.
  		 */
  		if (fsm || vm)
! 			XLogFlush(lsn);
  	}
  
  	/* Do the real work */
--- 285,297 ----
  		 * or visibility map. If we crashed during that window, we'd be left
  		 * with a truncated heap, but the FSM or visibility map would still
  		 * contain entries for the non-existent heap pages.
+ 		 *
+ 		 * Also ensure that the WAL is received by the synchronous standby.
+ 		 * Otherwise, we may have a situation where the heap is truncated, but
+ 		 * the action never replayed on the standby
  		 */
  		if (fsm || vm)
! 			XLogFlush(lsn, true, true);
  	}
  
  	/* Do the real work */
***************
*** 519,525 **** smgr_redo(XLogRecPtr lsn, XLogRecord *record)
  		 * after truncation, but that would leave a small window where the
  		 * WAL-first rule could be violated.
  		 */
! 		XLogFlush(lsn);
  
  		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
  
--- 523,529 ----
  		 * after truncation, but that would leave a small window where the
  		 * WAL-first rule could be violated.
  		 */
! 		XLogFlush(lsn, true, false);
  
  		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
  
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 65,70 **** char	   *SyncRepStandbyNames;
--- 65,72 ----
  static bool announce_next_takeover = true;
  
  static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+ static int	SyncTransferMode = SYNC_REP_NO_WAIT;
+ int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
  
  static void SyncRepQueueInsert(int mode);
  static void SyncRepCancelWait(void);
***************
*** 85,109 **** static bool SyncRepQueueIsOrderedByLSN(int mode);
   * Wait for synchronous replication, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING before adding ourselves
!  * to the wait queue. During SyncRepWakeQueue() a WALSender changes
!  * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
!  * This backend then resets its state to SYNC_REP_NOT_WAITING.
   */
! void
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
! 	int			mode = SyncRepWaitMode;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or there are no
  	 * sync replication standby names defined. Note that those standbys don't
  	 * need to be connected.
  	 */
! 	if (!SyncRepRequested() || !SyncStandbysDefined())
! 		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
--- 87,147 ----
   * Wait for synchronous replication, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
!  * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
!  * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
!  * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
!  *
!  * ForDataFlush - if TRUE, we wait for the flushing data page.
!  * Otherwise wait for the sync standby
!  *
!  * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
!  * the standby has already made progressed upto the given XactCommitLSN
!  *
!  * Return TRUE if either the sync standby is not
!  * configured/turned off OR the standby has made enough progress
   */
! bool
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
! 	int			mode = !ForDataFlush ? SyncRepWaitMode : SyncTransferMode;
! 	bool		ret;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or there are no
  	 * sync replication standby names defined. Note that those standbys don't
  	 * need to be connected.
  	 */
! 	if ((!SyncRepRequested() || !SyncStandbysDefined()) &&
! 	    !SyncTransRequested() && !ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If the caller has specified ForDataFlush, but synchronous transfer
! 	 * is not specified or its turned off, exit.
! 	 *
! 	 * We would like to allow the failback safe mechanism even for cascaded
! 	 * standbys as well. But we can't really wait for the standby to catch
! 	 * up until we reach a consistent state since the standbys won't be
! 	 * even able to connect without us reaching in that state (XXX Confirm)
! 	 */
! 	if ((!SyncTransRequested()) && ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If the caller has not specified ForDataFlush, but synchronous commit
! 	 * is skipped by values of synchronous_transfer, exit.
! 	 */
! 	if (IsSyncRepSkipped() && !ForDataFlush)
! 		return true;
! 
! 	/*
! 	 *Exit if we are told not to block on the standby.
! 	 */
! 	if (!Wait)
! 		return false;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
***************
*** 119,129 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * condition but we'll be fetching that cache line anyway so its likely to
  	 * be a low cost check.
  	 */
! 	if (!WalSndCtl->sync_standbys_defined ||
! 		XactCommitLSN <= WalSndCtl->lsn[mode])
  	{
  		LWLockRelease(SyncRepLock);
! 		return;
  	}
  
  	/*
--- 157,166 ----
  	 * condition but we'll be fetching that cache line anyway so its likely to
  	 * be a low cost check.
  	 */
! 	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) || XactCommitLSN <= WalSndCtl->lsn[mode])
  	{
  		LWLockRelease(SyncRepLock);
! 		return true;
  	}
  
  	/*
***************
*** 150,155 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 187,194 ----
  		new_status[len] = '\0'; /* truncate off " waiting ..." */
  	}
  
+ 	ret = false;
+ 
  	/*
  	 * Wait for specified LSN to be confirmed.
  	 *
***************
*** 186,192 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 225,234 ----
  			LWLockRelease(SyncRepLock);
  		}
  		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+ 		{
+ 			ret = true;
  			break;
+ 		}
  
  		/*
  		 * If a wait for synchronous replication is pending, we can neither
***************
*** 263,268 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 305,312 ----
  		set_ps_display(new_status, false);
  		pfree(new_status);
  	}
+ 
+ 	return ret;
  }
  
  /*
***************
*** 370,375 **** SyncRepReleaseWaiters(void)
--- 414,420 ----
  	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
+ 	int			numdataflush = 0;
  	int			priority = 0;
  	int			i;
  
***************
*** 379,389 **** SyncRepReleaseWaiters(void)
  	 * up, still running base backup or the current flush position is still
  	 * invalid, then leave quickly also.
  	 */
! 	if (MyWalSnd->sync_standby_priority == 0 ||
! 		MyWalSnd->state < WALSNDSTATE_STREAMING ||
  		XLogRecPtrIsInvalid(MyWalSnd->flush))
  		return;
- 
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
--- 424,433 ----
  	 * up, still running base backup or the current flush position is still
  	 * invalid, then leave quickly also.
  	 */
! 
! 	if (MyWalSnd->state < WALSNDSTATE_STREAMING ||
  		XLogRecPtrIsInvalid(MyWalSnd->flush))
  		return;
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
***************
*** 399,405 **** SyncRepReleaseWaiters(void)
  
  		if (walsnd->pid != 0 &&
  			walsnd->state == WALSNDSTATE_STREAMING &&
- 			walsnd->sync_standby_priority > 0 &&
  			(priority == 0 ||
  			 priority > walsnd->sync_standby_priority) &&
  			!XLogRecPtrIsInvalid(walsnd->flush))
--- 443,448 ----
***************
*** 428,449 **** SyncRepReleaseWaiters(void)
  	 * Set the lsn first so that when we wake backends they will release up to
  	 * this location.
  	 */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
  		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
--- 471,497 ----
  	 * Set the lsn first so that when we wake backends they will release up to
  	 * this location.
  	 */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] <= MyWalSnd->write)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] <= MyWalSnd->flush)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
+ 	if (walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] <= MyWalSnd->flush)
+ 	{
+ 		walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] = MyWalSnd->flush;
+ 		numdataflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_DATA_FLUSH);
+ 	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
  		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 		 numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
***************
*** 709,711 **** assign_synchronous_commit(int newval, void *extra)
--- 757,774 ----
  			break;
  	}
  }
+ 
+ void
+ assign_synchronous_transfer(int newval, void *extra)
+ {
+ 	switch (newval)
+ 	{
+ 		case SYNCHRONOUS_TRANSFER_ALL:
+ 		case SYNCHRONOUS_TRANSFER_DATA_FLUSH:
+ 			SyncTransferMode = SYNC_REP_WAIT_DATA_FLUSH;
+ 			break;
+ 		default:
+ 			SyncTransferMode = SYNC_REP_NO_WAIT;
+ 			break;
+ 	}
+ }
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 41,46 ****
--- 41,47 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "replication/syncrep.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***************
*** 1975,1981 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
! 		XLogFlush(recptr);
  
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
--- 1976,1982 ----
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
! 		XLogFlush(recptr, true, true);
  
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
*** a/src/backend/utils/cache/relmapper.c
--- b/src/backend/utils/cache/relmapper.c
***************
*** 721,727 **** write_relmap_file(bool shared, RelMapFile *newmap,
  		lsn = XLogInsert(RM_RELMAP_ID, XLOG_RELMAP_UPDATE, rdata);
  
  		/* As always, WAL must hit the disk before the data update does */
! 		XLogFlush(lsn);
  	}
  
  	errno = 0;
--- 721,732 ----
  		lsn = XLogInsert(RM_RELMAP_ID, XLOG_RELMAP_UPDATE, rdata);
  
  		/* As always, WAL must hit the disk before the data update does */
! 		XLogFlush(lsn, true, false);
! 
! 		/*
! 		 * XXX Should we also wait for the failback safe standby to receive the
! 		 * WAL ?
! 		 */
  	}
  
  	errno = 0;
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 381,386 **** static const struct config_enum_entry synchronous_commit_options[] = {
--- 381,398 ----
  };
  
  /*
+  * Although only "all", "data_flush", and "commit" are documented, we
+  * accept all the likely variants of "off".
+  */
+ static const struct config_enum_entry synchronous_transfer_options[] = {
+ 	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+ 	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+ 	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},
+ 	{"0", SYNCHRONOUS_TRANSFER_COMMIT, true},
+ 	{NULL, 0, false}
+ };
+ 
+ /*
   * Options for enum values stored in other modules
   */
  extern const struct config_enum_entry wal_level_options[];
***************
*** 3266,3271 **** static struct config_enum ConfigureNamesEnum[] =
--- 3278,3293 ----
  	},
  
  	{
+ 		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+ 			gettext_noop("Sets the data flush synchronization level"),
+ 			NULL
+ 		},
+ 		&synchronous_transfer,
+ 		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+ 		NULL, assign_synchronous_transfer, NULL
+ 	},
+ 
+ 	{
  		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
  			gettext_noop("Enables logging of recovery-related debugging information."),
  			gettext_noop("Each level includes all the levels that follow it. The later"
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 220,225 ****
--- 220,227 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_transfer = commit	# data page synchronization level
+ 				# commit, data_flush or all
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 62,67 ****
--- 62,68 ----
  #include "access/subtrans.h"
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "replication/syncrep.h"
  #include "storage/bufmgr.h"
  #include "storage/procarray.h"
  #include "utils/tqual.h"
***************
*** 118,123 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer,
--- 119,133 ----
  
  		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
  			return;				/* not flushed yet, so don't set hint */
+ 
+ 		/*
+ 		 * If synchronous_transfer is configured to data_flush or all, we should
+ 		 * also check if the commit WAL record has made to the standby before
+ 		 * allowing hint bit updates. We should not wait for the standby to receive
+ 		 * the WAL since its OK to delay hint bit updates
+ 		 */
+ 		if (!SyncRepWaitForLSN(commitLSN, true, false))
+ 			return;
  	}
  
  	tuple->t_infomask |= infomask;
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,267 **** typedef struct CheckpointStatsData
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
! extern void XLogFlush(XLogRecPtr RecPtr);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
  extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
--- 261,267 ----
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
! extern void XLogFlush(XLogRecPtr RecPtr, bool ForDataFush, bool Wait);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
  extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 19,41 ****
  #define SyncRepRequested() \
  	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  
  /* SyncRepWaitMode */
! #define SYNC_REP_NO_WAIT		-1
! #define SYNC_REP_WAIT_WRITE		0
! #define SYNC_REP_WAIT_FLUSH		1
  
! #define NUM_SYNC_REP_WAIT_MODE	2
  
  /* syncRepState */
! #define SYNC_REP_NOT_WAITING		0
! #define SYNC_REP_WAITING			1
! #define SYNC_REP_WAIT_COMPLETE		2
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
  /* called by user backend */
! extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);
--- 19,60 ----
  #define SyncRepRequested() \
  	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  
+ #define SyncTransRequested() \
+ 	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+ 
+ #define IsSyncRepSkipped() \
+ 	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+ 
  /* SyncRepWaitMode */
! #define SYNC_REP_NO_WAIT					-1
! #define SYNC_REP_WAIT_WRITE					0
! #define SYNC_REP_WAIT_FLUSH					1
! #define SYNC_REP_WAIT_DATA_FLUSH	2
  
! #define NUM_SYNC_REP_WAIT_MODE				3
  
  /* syncRepState */
! #define SYNC_REP_NOT_WAITING					0
! #define SYNC_REP_WAITING						1
! #define SYNC_REP_WAIT_COMPLETE					2
! 
! typedef enum
! {
! 	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
! 	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
! 										 * no wait for WAL */
! 	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush */
! }	SynchronousTransferLevel;
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
+ /* user-settable parameters for failback safe replication */
+ extern int	synchronous_transfer;
+ 
  /* called by user backend */
! extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
! 		bool ForDataFlush, bool Wait);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);
***************
*** 52,56 **** extern int	SyncRepWakeQueue(bool all, int mode);
--- 71,76 ----
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);
+ extern void assign_synchronous_transfer(int newval, void *extra);
  
  #endif   /* _SYNCREP_H */

#56

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#55)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Sun, Jul 7, 2013 at 4:27 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Sun, Jul 7, 2013 at 4:19 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I agree. We should probably find a better name for this. Any suggestions ?

err, I already made one...

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

Since commits are more foreground in nature and this feature
does not require us to wait during common foreground activities, we want a
configuration where master can wait for synchronous transfers at other than
commits. May we can solve that by having more granular control to the said
parameter ?

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

How would we then distinguish between synchronous and the new kind of
standby ?

That's not the point. The point is "Why would we have a new kind of
standby?" and therefore why do we need new parameters?

I am told, one of the very popular setups for DR is to have one
local sync standby and one async (may be cascaded by the local sync). Since
this new feature is more useful for DR because taking a fresh backup on a
slower link is even more challenging, IMHO we should support such setups.

...which still doesn't make sense to me. Lets look at that in detail.

Take 3 servers, A, B, C with A and B being linked by sync rep, and C
being safety standby at a distance.

Either A or B is master, except in disaster. So if A is master, then B
would be the failover target. If A fails, then you want to failover to
B. Once B is the target, you want to failback to A as the master. C
needs to follow the new master, whichever it is.

If you set up sync rep between A and B and this new mode between A and
C. When B becomes the master, you need to failback from B from A, but
you can't because the new mode applied between A and C only, so you
have to failback from C to A. So having the new mode not match with
sync rep means you are forcing people to failback using the slow link
in the common case.

You might observe that having the two modes match causes problems if A
and B fail, so you are forced to go to C as master and then eventually
failback to A or B across a slow link. That case is less common and
could be solved by extending sync transfer to more/multi nodes.

It definitely doesn't make sense to have sync rep on anything other
than a subset of sync transfer. So while it may be sensible in the
future to make sync transfer a superset of sync rep nodes, it makes
sense to make them the same config for now.

I have updated the patch.

we support following 2 cases.
1. SYNC server and also make same failback safe standby server
2. ASYNC server and also make same failback safe standby server

1. changed name of parameter
give up 'failback_safe_standby_names' parameter from the first patch.
and changed name of parameter from 'failback_safe_mode ' to
'synchronous_transfer'.
this parameter accepts 'all', 'data_flush' and 'commit'.

-'commit'
'commit' means that master waits for corresponding WAL to flushed
to disk of standby server on commits.
but master doesn't waits for replicated data pages.

-'data_flush'
'data_flush' means that master waits for replicated data page
(e.g, CLOG, pg_control) before flush to disk of master server.
but if user set to 'data_flush' to this parameter,
'synchronous_commit' values is ignored even if user set
'synchronous_commit'.

-'all'
'all' means that master waits for replicated WAL and data page.

2. put SyncRepWaitForLSN() function into XLogFlush() function
we have put SyncRepWaitForLSN() function into XLogFlush() function,
and change argument of XLogFlush().

they are setup case and need to set parameters.

- SYNC server and also make same failback safe standgy server (case 1)
synchronous_transfer = all
synchronous_commit = remote_write/on
synchronous_standby_names = <ServerName>

- ASYNC server and also make same failback safe standgy server (case 2)
synchronous_transfer = data_flush
(synchronous_commit values is ignored)

- default SYNC replication
synchronous_transfer = commit
synchronous_commit = on
synchronous_standby_names = <ServerName>

- default ASYNC replication
synchronous_transfer = commit

ToDo
1. currently this patch supports synchronous transfer. so we can't set
different synchronous transfer mode to each server.
we need to improve the patch for support following cases.
- SYNC standby and make separate ASYNC failback safe standby
- ASYNC standby and make separate ASYNC failback safe standby

2. we have not measure performance yet. we need to measure perfomance.

please give me your feedback.

Regards,

-------
Sawada Masahiko

I'm sorry. I forgot attached the patch.
Please see the attached file.

Regards,

-------
Sawada Masahiko

I found a bug which occurred when we do vacuum, and have fixed it.
yesterday (8th July) "Improve scalability of WAL insertions" patch is
committed to HEAD. so v2 patch does not apply to HEAD now.
I also have fixed it to be applicable to HEAD.

please find the attached patch.

Regards,

-------
Sawada Masahiko

Attachments:

failback_safe_standby_v3.patchapplication/octet-stream; name=failback_safe_standby_v3.patchDownload

*** a/src/backend/access/transam/clog.c
--- b/src/backend/access/transam/clog.c
***************
*** 722,728 **** WriteTruncateXlogRec(int pageno)
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
  	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
! 	XLogFlush(recptr);
  }
  
  /*
--- 722,728 ----
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
  	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
! 	XLogFlush(recptr, true, false);
  }
  
  /*
*** a/src/backend/access/transam/slru.c
--- b/src/backend/access/transam/slru.c
***************
*** 696,704 **** SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
  			 * XLogFlush were to fail, we must PANIC.  This isn't much of a
  			 * restriction because XLogFlush is just about all critical
  			 * section anyway, but let's make sure.
  			 */
  			START_CRIT_SECTION();
! 			XLogFlush(max_lsn);
  			END_CRIT_SECTION();
  		}
  	}
--- 696,706 ----
  			 * XLogFlush were to fail, we must PANIC.  This isn't much of a
  			 * restriction because XLogFlush is just about all critical
  			 * section anyway, but let's make sure.
+ 			 * Also wait for the synchronous standby to receive WAL upto
+ 			 * max_lsn.
  			 */
  			START_CRIT_SECTION();
! 			XLogFlush(max_lsn, true, true);
  			END_CRIT_SECTION();
  		}
  	}
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1049,1055 **** EndPrepare(GlobalTransaction gxact)
  
  	gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
  									records.head);
! 	XLogFlush(gxact->prepare_lsn);
  
  	/* If we crash now, we have prepared: WAL replay will fix things */
  
--- 1049,1062 ----
  
  	gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
  									records.head);
! 
! 	/*
! 	 * Wait for synchronous replication, if required.
! 	 *
! 	 * Note that at this stage we have marked the prepare, but still show as
! 	 * running in the procarray (twice!) and continue to hold locks.
! 	 */
! 	XLogFlush(gxact->prepare_lsn, false, true);
  
  	/* If we crash now, we have prepared: WAL replay will fix things */
  
***************
*** 1090,1103 **** EndPrepare(GlobalTransaction gxact)
  
  	END_CRIT_SECTION();
  
- 	/*
- 	 * Wait for synchronous replication, if required.
- 	 *
- 	 * Note that at this stage we have marked the prepare, but still show as
- 	 * running in the procarray (twice!) and continue to hold locks.
- 	 */
- 	SyncRepWaitForLSN(gxact->prepare_lsn);
- 
  	records.tail = records.head = NULL;
  }
  
--- 1097,1102 ----
***************
*** 2046,2053 **** RecordTransactionCommitPrepared(TransactionId xid,
  	 * a contradiction)
  	 */
  
! 	/* Flush XLOG to disk */
! 	XLogFlush(recptr);
  
  	/* Mark the transaction committed in pg_clog */
  	TransactionIdCommitTree(xid, nchildren, children);
--- 2045,2058 ----
  	 * a contradiction)
  	 */
  
! 	/*
! 	 * Flush XLOG to disk,
! 	 * Wait for synchronous replication, if required.
! 	 *
! 	 * Note that at this stage we have marked clog, but still show as running
! 	 * in the procarray and continue to hold locks.
! 	 */
! 	XLogFlush(recptr, false, true);
  
  	/* Mark the transaction committed in pg_clog */
  	TransactionIdCommitTree(xid, nchildren, children);
***************
*** 2056,2069 **** RecordTransactionCommitPrepared(TransactionId xid,
  	MyPgXact->delayChkpt = false;
  
  	END_CRIT_SECTION();
- 
- 	/*
- 	 * Wait for synchronous replication, if required.
- 	 *
- 	 * Note that at this stage we have marked clog, but still show as running
- 	 * in the procarray and continue to hold locks.
- 	 */
- 	SyncRepWaitForLSN(recptr);
  }
  
  /*
--- 2061,2066 ----
***************
*** 2126,2133 **** RecordTransactionAbortPrepared(TransactionId xid,
  
  	recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT_PREPARED, rdata);
  
! 	/* Always flush, since we're about to remove the 2PC state file */
! 	XLogFlush(recptr);
  
  	/*
  	 * Mark the transaction aborted in clog.  This is not absolutely necessary
--- 2123,2136 ----
  
  	recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT_PREPARED, rdata);
  
! 	/*
! 	 * Always flush, since we're about to remove the 2PC state file.
! 	 * Wait for synchronous replication, if required.
! 	 *
! 	 * Note that at this stage we have marked clog, but still show as running
! 	 * in the procarray and continue to hold locks.
! 	 */
! 	XLogFlush(recptr, false, true);
  
  	/*
  	 * Mark the transaction aborted in clog.  This is not absolutely necessary
***************
*** 2136,2147 **** RecordTransactionAbortPrepared(TransactionId xid,
  	TransactionIdAbortTree(xid, nchildren, children);
  
  	END_CRIT_SECTION();
- 
- 	/*
- 	 * Wait for synchronous replication, if required.
- 	 *
- 	 * Note that at this stage we have marked clog, but still show as running
- 	 * in the procarray and continue to hold locks.
- 	 */
- 	SyncRepWaitForLSN(recptr);
  }
--- 2139,2142 ----
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 965,970 **** RecordTransactionCommit(void)
--- 965,971 ----
  	SharedInvalidationMessage *invalMessages = NULL;
  	bool		RelcacheInitFileInval = false;
  	bool		wrote_xlog;
+ 	bool		ret = false;
  
  	/* Get data needed for commit record */
  	nrels = smgrGetPendingDeletes(true, &rels);
***************
*** 1143,1150 **** RecordTransactionCommit(void)
  	if ((wrote_xlog && synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
  		forceSyncCommit || nrels > 0)
  	{
! 		XLogFlush(XactLastRecEnd);
! 
  		/*
  		 * Now we may update the CLOG, if we wrote a COMMIT record above
  		 */
--- 1144,1151 ----
  	if ((wrote_xlog && synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
  		forceSyncCommit || nrels > 0)
  	{
! 		XLogFlush(XactLastRecEnd, false, true);
! 		ret = true;
  		/*
  		 * Now we may update the CLOG, if we wrote a COMMIT record above
  		 */
***************
*** 1195,1201 **** RecordTransactionCommit(void)
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
--- 1196,1202 ----
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd, false, !ret);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
***************
*** 4663,4669 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
  		 * after deletion, but that would leave a small window where the
  		 * WAL-first rule would be violated.
  		 */
! 		XLogFlush(lsn);
  
  		for (i = 0; i < nrels; i++)
  		{
--- 4664,4670 ----
  		 * after deletion, but that would leave a small window where the
  		 * WAL-first rule would be violated.
  		 */
! 		XLogFlush(lsn, true, true);
  
  		for (i = 0; i < nrels; i++)
  		{
***************
*** 4690,4696 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
  	 * for any user that requested ForceSyncCommit().
  	 */
  	if (XactCompletionForceSyncCommit(xinfo))
! 		XLogFlush(lsn);
  
  }
  
--- 4691,4697 ----
  	 * for any user that requested ForceSyncCommit().
  	 */
  	if (XactCompletionForceSyncCommit(xinfo))
! 		XLogFlush(lsn, true, false);
  
  }
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1224,1230 **** begin:;
  	if (isLogSwitch)
  	{
  		TRACE_POSTGRESQL_XLOG_SWITCH();
! 		XLogFlush(EndPos);
  		/*
  		 * Even though we reserved the rest of the segment for us, which is
  		 * reflected in EndPos, we return a pointer to just the end of the
--- 1224,1230 ----
  	if (isLogSwitch)
  	{
  		TRACE_POSTGRESQL_XLOG_SWITCH();
! 		XLogFlush(EndPos, true, true);
  		/*
  		 * Even though we reserved the rest of the segment for us, which is
  		 * reflected in EndPos, we return a pointer to just the end of the
***************
*** 2996,3002 **** UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
   * already held, and we try to avoid acquiring it if possible.
   */
  void
! XLogFlush(XLogRecPtr record)
  {
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
--- 2996,3002 ----
   * already held, and we try to avoid acquiring it if possible.
   */
  void
! XLogFlush(XLogRecPtr record, bool ForDataFlush, bool Wait)
  {
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
***************
*** 3134,3139 **** XLogFlush(XLogRecPtr record)
--- 3134,3141 ----
  	/* wake up walsenders now that we've released heavily contended locks */
  	WalSndWakeupProcessRequests();
  
+ 	SyncRepWaitForLSN(WriteRqstPtr, ForDataFlush, Wait);
+ 
  	/*
  	 * If we still haven't flushed to the request point then we have a
  	 * problem; most likely, the requested flush point is past end of XLOG.
***************
*** 8230,8236 **** CreateCheckPoint(int flags)
  						XLOG_CHECKPOINT_ONLINE,
  						&rdata);
  
! 	XLogFlush(recptr);
  
  	/*
  	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
--- 8232,8243 ----
  						XLOG_CHECKPOINT_ONLINE,
  						&rdata);
  
! 	/*
! 	 * At this point, ensure that the synchronous standby has received the
! 	 * checkpoint WAL. Otherwise failure after the control file update will
! 	 * cause the master to start from a location not known to the standby
! 	 */
! 	XLogFlush(recptr, true, !shutdown);
  
  	/*
  	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
***************
*** 8387,8393 **** CreateEndOfRecoveryRecord(void)
  
  	recptr = XLogInsert(RM_XLOG_ID, XLOG_END_OF_RECOVERY, &rdata);
  
! 	XLogFlush(recptr);
  
  	/*
  	 * Update the control file so that crash recovery can follow the timeline
--- 8394,8400 ----
  
  	recptr = XLogInsert(RM_XLOG_ID, XLOG_END_OF_RECOVERY, &rdata);
  
! 	XLogFlush(recptr, true, true);
  
  	/*
  	 * Update the control file so that crash recovery can follow the timeline
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 285,293 **** RelationTruncate(Relation rel, BlockNumber nblocks)
  		 * or visibility map. If we crashed during that window, we'd be left
  		 * with a truncated heap, but the FSM or visibility map would still
  		 * contain entries for the non-existent heap pages.
  		 */
  		if (fsm || vm)
! 			XLogFlush(lsn);
  	}
  
  	/* Do the real work */
--- 285,297 ----
  		 * or visibility map. If we crashed during that window, we'd be left
  		 * with a truncated heap, but the FSM or visibility map would still
  		 * contain entries for the non-existent heap pages.
+ 		 *
+ 		 * Also ensure that the WAL is received by the synchronous standby.
+ 		 * Otherwise, we may have a situation where the heap is truncated, but
+ 		 * the action never replayed on the standby
  		 */
  		if (fsm || vm)
! 			XLogFlush(lsn, true, true);
  	}
  
  	/* Do the real work */
***************
*** 519,525 **** smgr_redo(XLogRecPtr lsn, XLogRecord *record)
  		 * after truncation, but that would leave a small window where the
  		 * WAL-first rule could be violated.
  		 */
! 		XLogFlush(lsn);
  
  		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
  
--- 523,529 ----
  		 * after truncation, but that would leave a small window where the
  		 * WAL-first rule could be violated.
  		 */
! 		XLogFlush(lsn, true, false);
  
  		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
  
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 65,70 **** char	   *SyncRepStandbyNames;
--- 65,72 ----
  static bool announce_next_takeover = true;
  
  static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+ static int	SyncTransferMode = SYNC_REP_NO_WAIT;
+ int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
  
  static void SyncRepQueueInsert(int mode);
  static void SyncRepCancelWait(void);
***************
*** 85,109 **** static bool SyncRepQueueIsOrderedByLSN(int mode);
   * Wait for synchronous replication, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING before adding ourselves
!  * to the wait queue. During SyncRepWakeQueue() a WALSender changes
!  * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
!  * This backend then resets its state to SYNC_REP_NOT_WAITING.
   */
! void
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
! 	int			mode = SyncRepWaitMode;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or there are no
  	 * sync replication standby names defined. Note that those standbys don't
  	 * need to be connected.
  	 */
! 	if (!SyncRepRequested() || !SyncStandbysDefined())
! 		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
--- 87,141 ----
   * Wait for synchronous replication, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
!  * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
!  * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
!  * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
!  *
!  * ForDataFlush - if TRUE, we wait for the flushing data page.
!  * Otherwise wait for the sync standby
!  *
!  * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
!  * the standby has already made progressed upto the given XactCommitLSN
!  *
!  * Return TRUE if either the sync standby is not
!  * configured/turned off OR the standby has made enough progress
   */
! bool
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
! 	int			mode = !ForDataFlush ? SyncRepWaitMode : SyncTransferMode;
! 	bool		ret;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or there are no
  	 * sync replication standby names defined. Note that those standbys don't
  	 * need to be connected.
  	 */
! 	if ((!SyncRepRequested() || !SyncStandbysDefined()) &&
! 	    !SyncTransRequested() && !ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If the caller has specified ForDataFlush, but synchronous transfer
! 	 * is not specified or its turned off, exit.
! 	 *
! 	 * We would like to allow the failback safe mechanism even for cascaded
! 	 * standbys as well. But we can't really wait for the standby to catch
! 	 * up until we reach a consistent state since the standbys won't be
! 	 * even able to connect without us reaching in that state (XXX Confirm)
! 	 */
! 	if ((!SyncTransRequested()) && ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If the caller has not specified ForDataFlush, but synchronous commit
! 	 * is skipped by values of synchronous_transfer, exit.
! 	 */
! 	if (IsSyncRepSkipped() && !ForDataFlush)
! 		return true;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
***************
*** 118,129 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * Also check that the standby hasn't already replied. Unlikely race
  	 * condition but we'll be fetching that cache line anyway so its likely to
  	 * be a low cost check.
  	 */
! 	if (!WalSndCtl->sync_standbys_defined ||
! 		XactCommitLSN <= WalSndCtl->lsn[mode])
  	{
  		LWLockRelease(SyncRepLock);
! 		return;
  	}
  
  	/*
--- 150,163 ----
  	 * Also check that the standby hasn't already replied. Unlikely race
  	 * condition but we'll be fetching that cache line anyway so its likely to
  	 * be a low cost check.
+ 	 * And if we are told not to block on the standby, exit
  	 */
! 	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
! 		XactCommitLSN <= WalSndCtl->lsn[mode] ||
! 		!Wait)
  	{
  		LWLockRelease(SyncRepLock);
! 		return true;
  	}
  
  	/*
***************
*** 150,155 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 184,191 ----
  		new_status[len] = '\0'; /* truncate off " waiting ..." */
  	}
  
+ 	ret = false;
+ 
  	/*
  	 * Wait for specified LSN to be confirmed.
  	 *
***************
*** 186,192 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 222,231 ----
  			LWLockRelease(SyncRepLock);
  		}
  		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+ 		{
+ 			ret = true;
  			break;
+ 		}
  
  		/*
  		 * If a wait for synchronous replication is pending, we can neither
***************
*** 263,268 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 302,309 ----
  		set_ps_display(new_status, false);
  		pfree(new_status);
  	}
+ 
+ 	return ret;
  }
  
  /*
***************
*** 370,375 **** SyncRepReleaseWaiters(void)
--- 411,417 ----
  	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
+ 	int			numdataflush = 0;
  	int			priority = 0;
  	int			i;
  
***************
*** 379,389 **** SyncRepReleaseWaiters(void)
  	 * up, still running base backup or the current flush position is still
  	 * invalid, then leave quickly also.
  	 */
! 	if (MyWalSnd->sync_standby_priority == 0 ||
! 		MyWalSnd->state < WALSNDSTATE_STREAMING ||
  		XLogRecPtrIsInvalid(MyWalSnd->flush))
  		return;
- 
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
--- 421,430 ----
  	 * up, still running base backup or the current flush position is still
  	 * invalid, then leave quickly also.
  	 */
! 
! 	if (MyWalSnd->state < WALSNDSTATE_STREAMING ||
  		XLogRecPtrIsInvalid(MyWalSnd->flush))
  		return;
  	/*
  	 * We're a potential sync standby. Release waiters if we are the highest
  	 * priority standby. If there are multiple standbys with same priorities
***************
*** 399,405 **** SyncRepReleaseWaiters(void)
  
  		if (walsnd->pid != 0 &&
  			walsnd->state == WALSNDSTATE_STREAMING &&
- 			walsnd->sync_standby_priority > 0 &&
  			(priority == 0 ||
  			 priority > walsnd->sync_standby_priority) &&
  			!XLogRecPtrIsInvalid(walsnd->flush))
--- 440,445 ----
***************
*** 428,449 **** SyncRepReleaseWaiters(void)
  	 * Set the lsn first so that when we wake backends they will release up to
  	 * this location.
  	 */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] < MyWalSnd->write)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] < MyWalSnd->flush)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
  		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
--- 468,494 ----
  	 * Set the lsn first so that when we wake backends they will release up to
  	 * this location.
  	 */
! 	if (walsndctl->lsn[SYNC_REP_WAIT_WRITE] <= MyWalSnd->write)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
  		numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
  	}
! 	if (walsndctl->lsn[SYNC_REP_WAIT_FLUSH] <= MyWalSnd->flush)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
+ 	if (walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] <= MyWalSnd->flush)
+ 	{
+ 		walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] = MyWalSnd->flush;
+ 		numdataflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_DATA_FLUSH);
+ 	}
  
  	LWLockRelease(SyncRepLock);
  
  	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
  		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 		 numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
***************
*** 709,711 **** assign_synchronous_commit(int newval, void *extra)
--- 754,771 ----
  			break;
  	}
  }
+ 
+ void
+ assign_synchronous_transfer(int newval, void *extra)
+ {
+ 	switch (newval)
+ 	{
+ 		case SYNCHRONOUS_TRANSFER_ALL:
+ 		case SYNCHRONOUS_TRANSFER_DATA_FLUSH:
+ 			SyncTransferMode = SYNC_REP_WAIT_DATA_FLUSH;
+ 			break;
+ 		default:
+ 			SyncTransferMode = SYNC_REP_NO_WAIT;
+ 			break;
+ 	}
+ }
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 41,46 ****
--- 41,47 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "replication/syncrep.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***************
*** 1975,1981 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
! 		XLogFlush(recptr);
  
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
--- 1976,1982 ----
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
! 		XLogFlush(recptr, true, true);
  
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
*** a/src/backend/utils/cache/relmapper.c
--- b/src/backend/utils/cache/relmapper.c
***************
*** 721,727 **** write_relmap_file(bool shared, RelMapFile *newmap,
  		lsn = XLogInsert(RM_RELMAP_ID, XLOG_RELMAP_UPDATE, rdata);
  
  		/* As always, WAL must hit the disk before the data update does */
! 		XLogFlush(lsn);
  	}
  
  	errno = 0;
--- 721,732 ----
  		lsn = XLogInsert(RM_RELMAP_ID, XLOG_RELMAP_UPDATE, rdata);
  
  		/* As always, WAL must hit the disk before the data update does */
! 		XLogFlush(lsn, true, false);
! 
! 		/*
! 		 * XXX Should we also wait for the failback safe standby to receive the
! 		 * WAL ?
! 		 */
  	}
  
  	errno = 0;
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 381,386 **** static const struct config_enum_entry synchronous_commit_options[] = {
--- 381,398 ----
  };
  
  /*
+  * Although only "all", "data_flush", and "commit" are documented, we
+  * accept all the likely variants of "off".
+  */
+ static const struct config_enum_entry synchronous_transfer_options[] = {
+ 	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+ 	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+ 	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},
+ 	{"0", SYNCHRONOUS_TRANSFER_COMMIT, true},
+ 	{NULL, 0, false}
+ };
+ 
+ /*
   * Options for enum values stored in other modules
   */
  extern const struct config_enum_entry wal_level_options[];
***************
*** 3277,3282 **** static struct config_enum ConfigureNamesEnum[] =
--- 3289,3304 ----
  	},
  
  	{
+ 		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+ 			gettext_noop("Sets the data flush synchronization level"),
+ 			NULL
+ 		},
+ 		&synchronous_transfer,
+ 		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+ 		NULL, assign_synchronous_transfer, NULL
+ 	},
+ 
+ 	{
  		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
  			gettext_noop("Enables logging of recovery-related debugging information."),
  			gettext_noop("Each level includes all the levels that follow it. The later"
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 220,225 ****
--- 220,227 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_transfer = commit	# data page synchronization level
+ 				# commit, data_flush or all
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 62,67 ****
--- 62,68 ----
  #include "access/subtrans.h"
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "replication/syncrep.h"
  #include "storage/bufmgr.h"
  #include "storage/procarray.h"
  #include "utils/tqual.h"
***************
*** 118,123 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer,
--- 119,133 ----
  
  		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
  			return;				/* not flushed yet, so don't set hint */
+ 
+ 		/*
+ 		 * If synchronous_transfer is configured to data_flush or all, we should
+ 		 * also check if the commit WAL record has made to the standby before
+ 		 * allowing hint bit updates. We should not wait for the standby to receive
+ 		 * the WAL since its OK to delay hint bit updates
+ 		 */
+ 		if (!SyncRepWaitForLSN(commitLSN, true, false))
+ 			return;
  	}
  
  	tuple->t_infomask |= infomask;
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 262,268 **** typedef struct CheckpointStatsData
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
! extern void XLogFlush(XLogRecPtr RecPtr);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
  extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
--- 262,268 ----
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
! extern void XLogFlush(XLogRecPtr RecPtr, bool ForDataFush, bool Wait);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
  extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 19,41 ****
  #define SyncRepRequested() \
  	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  
  /* SyncRepWaitMode */
! #define SYNC_REP_NO_WAIT		-1
! #define SYNC_REP_WAIT_WRITE		0
! #define SYNC_REP_WAIT_FLUSH		1
  
! #define NUM_SYNC_REP_WAIT_MODE	2
  
  /* syncRepState */
! #define SYNC_REP_NOT_WAITING		0
! #define SYNC_REP_WAITING			1
! #define SYNC_REP_WAIT_COMPLETE		2
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
  /* called by user backend */
! extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);
--- 19,60 ----
  #define SyncRepRequested() \
  	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  
+ #define SyncTransRequested() \
+ 	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+ 
+ #define IsSyncRepSkipped() \
+ 	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+ 
  /* SyncRepWaitMode */
! #define SYNC_REP_NO_WAIT					-1
! #define SYNC_REP_WAIT_WRITE					0
! #define SYNC_REP_WAIT_FLUSH					1
! #define SYNC_REP_WAIT_DATA_FLUSH	2
  
! #define NUM_SYNC_REP_WAIT_MODE				3
  
  /* syncRepState */
! #define SYNC_REP_NOT_WAITING					0
! #define SYNC_REP_WAITING						1
! #define SYNC_REP_WAIT_COMPLETE					2
! 
! typedef enum
! {
! 	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
! 	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
! 										 * no wait for WAL */
! 	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush */
! }	SynchronousTransferLevel;
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
+ /* user-settable parameters for failback safe replication */
+ extern int	synchronous_transfer;
+ 
  /* called by user backend */
! extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
! 		bool ForDataFlush, bool Wait);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);
***************
*** 52,56 **** extern int	SyncRepWakeQueue(bool all, int mode);
--- 71,76 ----
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);
+ extern void assign_synchronous_transfer(int newval, void *extra);
  
  #endif   /* _SYNCREP_H */

#57

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#56)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Tue, Jul 9, 2013 at 11:45 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Sun, Jul 7, 2013 at 4:27 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
I found a bug which occurred when we do vacuum, and have fixed it.
yesterday (8th July) "Improve scalability of WAL insertions" patch is
committed to HEAD. so v2 patch does not apply to HEAD now.
I also have fixed it to be applicable to HEAD.

please find the attached patch.

Regards,

-------
Sawada Masahiko

I have fixed that master server doesn't waits for the WAL to be
flushed to disk of standby when master server execute FlushBuffer(),
and have attached v4 patch.
please find the attached patch.

Regards,

-------
Sawada Masahiko

Attachments:

failback_safe_standby_v4.patchapplication/octet-stream; name=failback_safe_standby_v4.patchDownload

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index cb95aa3..05ac0fa 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -722,7 +722,7 @@ WriteTruncateXlogRec(int pageno)
 	rdata.buffer = InvalidBuffer;
 	rdata.next = NULL;
 	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
-	XLogFlush(recptr);
+	XLogFlush(recptr, true, true);
 }
 
 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5a8f654..452af68 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -696,9 +696,11 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 			 * XLogFlush were to fail, we must PANIC.  This isn't much of a
 			 * restriction because XLogFlush is just about all critical
 			 * section anyway, but let's make sure.
+			 * Also wait for the synchronous standby to receive WAL upto
+			 * max_lsn.
 			 */
 			START_CRIT_SECTION();
-			XLogFlush(max_lsn);
+			XLogFlush(max_lsn, true, true);
 			END_CRIT_SECTION();
 		}
 	}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e975f8d..0c235c9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1049,7 +1049,14 @@ EndPrepare(GlobalTransaction gxact)
 
 	gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
 									records.head);
-	XLogFlush(gxact->prepare_lsn);
+
+	/*
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked the prepare, but still show as
+	 * running in the procarray (twice!) and continue to hold locks.
+	 */
+	XLogFlush(gxact->prepare_lsn, false, true);
 
 	/* If we crash now, we have prepared: WAL replay will fix things */
 
@@ -1090,14 +1097,6 @@ EndPrepare(GlobalTransaction gxact)
 
 	END_CRIT_SECTION();
 
-	/*
-	 * Wait for synchronous replication, if required.
-	 *
-	 * Note that at this stage we have marked the prepare, but still show as
-	 * running in the procarray (twice!) and continue to hold locks.
-	 */
-	SyncRepWaitForLSN(gxact->prepare_lsn);
-
 	records.tail = records.head = NULL;
 }
 
@@ -2046,8 +2045,14 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	 * a contradiction)
 	 */
 
-	/* Flush XLOG to disk */
-	XLogFlush(recptr);
+	/*
+	 * Flush XLOG to disk,
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked clog, but still show as running
+	 * in the procarray and continue to hold locks.
+	 */
+	XLogFlush(recptr, false, true);
 
 	/* Mark the transaction committed in pg_clog */
 	TransactionIdCommitTree(xid, nchildren, children);
@@ -2056,14 +2061,6 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	MyPgXact->delayChkpt = false;
 
 	END_CRIT_SECTION();
-
-	/*
-	 * Wait for synchronous replication, if required.
-	 *
-	 * Note that at this stage we have marked clog, but still show as running
-	 * in the procarray and continue to hold locks.
-	 */
-	SyncRepWaitForLSN(recptr);
 }
 
 /*
@@ -2126,8 +2123,14 @@ RecordTransactionAbortPrepared(TransactionId xid,
 
 	recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT_PREPARED, rdata);
 
-	/* Always flush, since we're about to remove the 2PC state file */
-	XLogFlush(recptr);
+	/*
+	 * Always flush, since we're about to remove the 2PC state file.
+	 * Wait for synchronous replication, if required.
+	 *
+	 * Note that at this stage we have marked clog, but still show as running
+	 * in the procarray and continue to hold locks.
+	 */
+	XLogFlush(recptr, false, true);
 
 	/*
 	 * Mark the transaction aborted in clog.  This is not absolutely necessary
@@ -2136,12 +2139,4 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	TransactionIdAbortTree(xid, nchildren, children);
 
 	END_CRIT_SECTION();
-
-	/*
-	 * Wait for synchronous replication, if required.
-	 *
-	 * Note that at this stage we have marked clog, but still show as running
-	 * in the procarray and continue to hold locks.
-	 */
-	SyncRepWaitForLSN(recptr);
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 31e868d..9331742 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -965,6 +965,7 @@ RecordTransactionCommit(void)
 	SharedInvalidationMessage *invalMessages = NULL;
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
+	bool		ret = false;
 
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
@@ -1143,8 +1144,8 @@ RecordTransactionCommit(void)
 	if ((wrote_xlog && synchronous_commit > SYNCHRONOUS_COMMIT_OFF) ||
 		forceSyncCommit || nrels > 0)
 	{
-		XLogFlush(XactLastRecEnd);
-
+		XLogFlush(XactLastRecEnd, false, true);
+		ret = true;
 		/*
 		 * Now we may update the CLOG, if we wrote a COMMIT record above
 		 */
@@ -1195,7 +1196,7 @@ RecordTransactionCommit(void)
 	 * in the procarray and continue to hold locks.
 	 */
 	if (wrote_xlog)
-		SyncRepWaitForLSN(XactLastRecEnd);
+		SyncRepWaitForLSN(XactLastRecEnd, false, !ret);
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	XactLastRecEnd = 0;
@@ -4663,7 +4664,7 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 		 * after deletion, but that would leave a small window where the
 		 * WAL-first rule would be violated.
 		 */
-		XLogFlush(lsn);
+		XLogFlush(lsn, true, false);
 
 		for (i = 0; i < nrels; i++)
 		{
@@ -4690,7 +4691,7 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 	 * for any user that requested ForceSyncCommit().
 	 */
 	if (XactCompletionForceSyncCommit(xinfo))
-		XLogFlush(lsn);
+		XLogFlush(lsn, true, false);
 
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9e3a7a..5f6b9ba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1224,7 +1224,7 @@ begin:;
 	if (isLogSwitch)
 	{
 		TRACE_POSTGRESQL_XLOG_SWITCH();
-		XLogFlush(EndPos);
+		XLogFlush(EndPos, true, true);
 		/*
 		 * Even though we reserved the rest of the segment for us, which is
 		 * reflected in EndPos, we return a pointer to just the end of the
@@ -2996,7 +2996,7 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
  * already held, and we try to avoid acquiring it if possible.
  */
 void
-XLogFlush(XLogRecPtr record)
+XLogFlush(XLogRecPtr record, bool ForDataFlush, bool Wait)
 {
 	XLogRecPtr	WriteRqstPtr;
 	XLogwrtRqst WriteRqst;
@@ -3134,6 +3134,8 @@ XLogFlush(XLogRecPtr record)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests();
 
+	SyncRepWaitForLSN(WriteRqstPtr, ForDataFlush, Wait);
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -8230,7 +8232,12 @@ CreateCheckPoint(int flags)
 						XLOG_CHECKPOINT_ONLINE,
 						&rdata);
 
-	XLogFlush(recptr);
+	/*
+	 * At this point, ensure that the synchronous standby has received the
+	 * checkpoint WAL. Otherwise failure after the control file update will
+	 * cause the master to start from a location not known to the standby
+	 */
+	XLogFlush(recptr, true, !shutdown);
 
 	/*
 	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
@@ -8387,7 +8394,7 @@ CreateEndOfRecoveryRecord(void)
 
 	recptr = XLogInsert(RM_XLOG_ID, XLOG_END_OF_RECOVERY, &rdata);
 
-	XLogFlush(recptr);
+	XLogFlush(recptr, true, true);
 
 	/*
 	 * Update the control file so that crash recovery can follow the timeline
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 971a149..a77471b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -285,9 +285,13 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		 * or visibility map. If we crashed during that window, we'd be left
 		 * with a truncated heap, but the FSM or visibility map would still
 		 * contain entries for the non-existent heap pages.
+		 *
+		 * Also ensure that the WAL is received by the synchronous standby.
+		 * Otherwise, we may have a situation where the heap is truncated, but
+		 * the action never replayed on the standby
 		 */
 		if (fsm || vm)
-			XLogFlush(lsn);
+			XLogFlush(lsn, true, true);
 	}
 
 	/* Do the real work */
@@ -519,7 +523,7 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
 		 * after truncation, but that would leave a small window where the
 		 * WAL-first rule could be violated.
 		 */
-		XLogFlush(lsn);
+		XLogFlush(lsn, true, false);
 
 		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
 
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 5424281..6a58350 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -65,6 +65,8 @@ char	   *SyncRepStandbyNames;
 static bool announce_next_takeover = true;
 
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+static int	SyncTransferMode = SYNC_REP_NO_WAIT;
+int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
@@ -85,25 +87,55 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  * Wait for synchronous replication, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
- * change that state to SYNC_REP_WAITING before adding ourselves
- * to the wait queue. During SyncRepWakeQueue() a WALSender changes
- * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
- * This backend then resets its state to SYNC_REP_NOT_WAITING.
+ * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
+ * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
+ * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
+ * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
+ *
+ * ForDataFlush - if TRUE, we wait for the flushing data page.
+ * Otherwise wait for the sync standby
+ *
+ * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
+ * the standby has already made progressed upto the given XactCommitLSN
+ *
+ * Return TRUE if either the sync standby is not
+ * configured/turned off OR the standby has made enough progress
  */
-void
-SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+bool
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
 {
 	char	   *new_status = NULL;
 	const char *old_status;
-	int			mode = SyncRepWaitMode;
+	int			mode = !ForDataFlush ? SyncRepWaitMode : SyncTransferMode;
+	bool		ret;
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
 	 * sync replication standby names defined. Note that those standbys don't
 	 * need to be connected.
 	 */
-	if (!SyncRepRequested() || !SyncStandbysDefined())
-		return;
+	if ((!SyncRepRequested() || !SyncStandbysDefined()) &&
+	    !SyncTransRequested() && !ForDataFlush)
+		return true;
+
+	/*
+	 * If the caller has specified ForDataFlush, but synchronous transfer
+	 * is not specified or its turned off, exit.
+	 *
+	 * We would like to allow the failback safe mechanism even for cascaded
+	 * standbys as well. But we can't really wait for the standby to catch
+	 * up until we reach a consistent state since the standbys won't be
+	 * even able to connect without us reaching in that state (XXX Confirm)
+	 */
+	if ((!SyncTransRequested()) && ForDataFlush)
+		return true;
+
+	/*
+	 * If the caller has not specified ForDataFlush, but synchronous commit
+	 * is skipped by values of synchronous_transfer, exit.
+	 */
+	if (IsSyncRepSkipped() && !ForDataFlush)
+		return true;
 
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 	Assert(WalSndCtl != NULL);
@@ -118,12 +150,14 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * Also check that the standby hasn't already replied. Unlikely race
 	 * condition but we'll be fetching that cache line anyway so its likely to
 	 * be a low cost check.
+	 * And if we are told not to block on the standby, exit
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
-		XactCommitLSN <= WalSndCtl->lsn[mode])
+	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
+		XactCommitLSN <= WalSndCtl->lsn[mode] ||
+		!Wait)
 	{
 		LWLockRelease(SyncRepLock);
-		return;
+		return true;
 	}
 
 	/*
@@ -150,6 +184,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		new_status[len] = '\0'; /* truncate off " waiting ..." */
 	}
 
+	ret = false;
+
 	/*
 	 * Wait for specified LSN to be confirmed.
 	 *
@@ -186,7 +222,10 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			LWLockRelease(SyncRepLock);
 		}
 		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+		{
+			ret = true;
 			break;
+		}
 
 		/*
 		 * If a wait for synchronous replication is pending, we can neither
@@ -263,6 +302,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		set_ps_display(new_status, false);
 		pfree(new_status);
 	}
+
+	return ret;
 }
 
 /*
@@ -370,6 +411,7 @@ SyncRepReleaseWaiters(void)
 	volatile WalSnd *syncWalSnd = NULL;
 	int			numwrite = 0;
 	int			numflush = 0;
+	int			numdataflush = 0;
 	int			priority = 0;
 	int			i;
 
@@ -379,11 +421,10 @@ SyncRepReleaseWaiters(void)
 	 * up, still running base backup or the current flush position is still
 	 * invalid, then leave quickly also.
 	 */
-	if (MyWalSnd->sync_standby_priority == 0 ||
-		MyWalSnd->state < WALSNDSTATE_STREAMING ||
+
+	if (MyWalSnd->state < WALSNDSTATE_STREAMING ||
 		XLogRecPtrIsInvalid(MyWalSnd->flush))
 		return;
-
 	/*
 	 * We're a potential sync standby. Release waiters if we are the highest
 	 * priority standby. If there are multiple standbys with same priorities
@@ -399,7 +440,6 @@ SyncRepReleaseWaiters(void)
 
 		if (walsnd->pid != 0 &&
 			walsnd->state == WALSNDSTATE_STREAMING &&
-			walsnd->sync_standby_priority > 0 &&
 			(priority == 0 ||
 			 priority > walsnd->sync_standby_priority) &&
 			!XLogRecPtrIsInvalid(walsnd->flush))
@@ -438,12 +478,17 @@ SyncRepReleaseWaiters(void)
 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
 		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
 	}
+	if (walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] < MyWalSnd->flush)
+	{
+		walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] = MyWalSnd->flush;
+		numdataflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_DATA_FLUSH);
+	}
 
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
 		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
-	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
+		 numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
 
 	/*
 	 * If we are managing the highest priority standby, though we weren't
@@ -709,3 +754,18 @@ assign_synchronous_commit(int newval, void *extra)
 			break;
 	}
 }
+
+void
+assign_synchronous_transfer(int newval, void *extra)
+{
+	switch (newval)
+	{
+		case SYNCHRONOUS_TRANSFER_ALL:
+		case SYNCHRONOUS_TRANSFER_DATA_FLUSH:
+			SyncTransferMode = SYNC_REP_WAIT_DATA_FLUSH;
+			break;
+		default:
+			SyncTransferMode = SYNC_REP_NO_WAIT;
+			break;
+	}
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8079226..fc603d7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -41,6 +41,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -1975,7 +1976,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	 * skip the flush if the buffer isn't permanent.
 	 */
 	if (buf->flags & BM_PERMANENT)
-		XLogFlush(recptr);
+		XLogFlush(recptr, true, false);
 
 	/*
 	 * Now it's safe to write buffer to disk. Note that no one else should
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 2c7d9f3..13eb2df 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -721,7 +721,12 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 		lsn = XLogInsert(RM_RELMAP_ID, XLOG_RELMAP_UPDATE, rdata);
 
 		/* As always, WAL must hit the disk before the data update does */
-		XLogFlush(lsn);
+		XLogFlush(lsn, true, true);
+
+		/*
+		 * XXX Should we also wait for the failback safe standby to receive the
+		 * WAL ?
+		 */
 	}
 
 	errno = 0;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5aefd1b..69568be 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -381,6 +381,18 @@ static const struct config_enum_entry synchronous_commit_options[] = {
 };
 
 /*
+ * Although only "all", "data_flush", and "commit" are documented, we
+ * accept all the likely variants of "off".
+ */
+static const struct config_enum_entry synchronous_transfer_options[] = {
+	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},
+	{"0", SYNCHRONOUS_TRANSFER_COMMIT, true},
+	{NULL, 0, false}
+};
+
+/*
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
@@ -3277,6 +3289,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Sets the data flush synchronization level"),
+			NULL
+		},
+		&synchronous_transfer,
+		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+		NULL, assign_synchronous_transfer, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..d6603c2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -220,6 +220,8 @@
 #synchronous_standby_names = ''	# standby servers that provide sync rep
 				# comma-separated list of application_name
 				# from standby(s); '*' = all
+#synchronous_transfer = commit	# data page synchronization level
+				# commit, data_flush or all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
 # - Standby Servers -
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 55563ea..c081ee0 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -62,6 +62,7 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
 #include "utils/tqual.h"
@@ -118,6 +119,15 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 
 		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
 			return;				/* not flushed yet, so don't set hint */
+
+		/*
+		 * If synchronous_transfer is configured to data_flush or all, we should
+		 * also check if the commit WAL record has made to the standby before
+		 * allowing hint bit updates. We should not wait for the standby to receive
+		 * the WAL since its OK to delay hint bit updates
+		 */
+		if (!SyncRepWaitForLSN(commitLSN, true, false))
+			return;
 	}
 
 	tuple->t_infomask |= infomask;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 002862c..68072d2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -262,7 +262,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
-extern void XLogFlush(XLogRecPtr RecPtr);
+extern void XLogFlush(XLogRecPtr RecPtr, bool ForDataFush, bool Wait);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
 extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ac23ea6..4540625 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -19,23 +19,42 @@
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
 
+#define SyncTransRequested() \
+	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+
+#define IsSyncRepSkipped() \
+	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+
 /* SyncRepWaitMode */
-#define SYNC_REP_NO_WAIT		-1
-#define SYNC_REP_WAIT_WRITE		0
-#define SYNC_REP_WAIT_FLUSH		1
+#define SYNC_REP_NO_WAIT					-1
+#define SYNC_REP_WAIT_WRITE					0
+#define SYNC_REP_WAIT_FLUSH					1
+#define SYNC_REP_WAIT_DATA_FLUSH	2
 
-#define NUM_SYNC_REP_WAIT_MODE	2
+#define NUM_SYNC_REP_WAIT_MODE				3
 
 /* syncRepState */
-#define SYNC_REP_NOT_WAITING		0
-#define SYNC_REP_WAITING			1
-#define SYNC_REP_WAIT_COMPLETE		2
+#define SYNC_REP_NOT_WAITING					0
+#define SYNC_REP_WAITING						1
+#define SYNC_REP_WAIT_COMPLETE					2
+
+typedef enum
+{
+	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
+	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
+										 * no wait for WAL */
+	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush */
+}	SynchronousTransferLevel;
 
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
 
+/* user-settable parameters for failback safe replication */
+extern int	synchronous_transfer;
+
 /* called by user backend */
-extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
+		bool ForDataFlush, bool Wait);
 
 /* called at backend exit */
 extern void SyncRepCleanupAtProcExit(void);
@@ -52,5 +71,6 @@ extern int	SyncRepWakeQueue(bool all, int mode);
 
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_commit(int newval, void *extra);
+extern void assign_synchronous_transfer(int newval, void *extra);
 
 #endif   /* _SYNCREP_H */

#58

Samrat Revagade

revagade.samrat@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#54)

Re: Patch for fail-back without fresh backup

ToDo
1. currently this patch supports synchronous transfer. so we can't set
different synchronous transfer mode to each server.
we need to improve the patch for support following cases.
- SYNC standby and make separate ASYNC failback safe standby
- ASYNC standby and make separate ASYNC failback safe standby

2. we have not measure performance yet. we need to measure perfomance.

Here are the tests results, showing the performance overhead of the
patch ( failback_safe_standby_v4.patch ):

Tests are carried out in two different scenarios:
1. Tests with Fast transaction workloads.
2. Test with large loads.

Test Type-1: Tests with pgbench (*Tests with fast transaction workloads*)

Notes:
1. These test are for testing the performance overhead caused by the
patch for fast transaction workloads.
2. Tests are performed with the pgbench benchmark and performance
measurement factor is the TPS value.
3. Values represents the TPS for 4 runs,and the last value represents
the average of all the runs.

Settings for tests:
transaction type: TPC-B (sort of)
scaling factor: 300
query mode: simple
number of clients:150
number of threads:1
duration:1800 s

Analysis of results:
1) Synchronous Replication :(753.06, 748.81, 748.38, 747.21, Avg- 747.2)
2) Synchronous Replication + Failsafe standby (commit) :
(729.13,724.33 , 713.59 , 710.79, Avg- 719.46)
3) Synchronous Replication + Failsafe standby (all) : (692.08, 688.08,
711.23,711.62, Avg- 700.75)

4) Asynchronous Replication :(1008.42, 993.39, 986.80 ,1028.46 ,
Avg-1004.26 )
5) Asynchronous Replication + Failsafe standby (commit) : (974.49,
978.60 ,969.11, 957.18 , Avg- 969.84)
6) Asynchronous Replication + Failsafe standby (data_flush) :
(1011.79, 992.05, 1030.20,940.50 , Avg- 993.63)

In above test results the performance numbers are very close to each
other, also because of noise they show variation. hence following is
approximate conclusion about the overhead of patch.

1. Streaming replication + synchronous_transfer (all , data_flush)::
a) On an average, synchronous replication combined with
synchronous_transfer (all ) causes 6.21 % performance overhead.
b) On an average, asynchronous streaming replication combined
synchronous_transfer (data_flush ) causes averagely 1.05 % performance
overhead.

2. Streaming replication + synchronous_transfer (commit):
a) On an average, synchronous replication combined with
synchronous_transfer (commit ) causes 3.71 % performance overhead.
b) On an average, asynchronous streaming replication combined with
synchronous_transfer (commit) causes averagely 3.42 % performance
overhead.

Test Type-2: Tests with pgbench -i (*Tests with large loads:*)

Notes:
1. These test are for testing the performance overhead caused by the
patch for large loads and index builds.
2. Tests are performed with the pgbench -i (initialization of test
data i.e the time taken for creating tables of pgbench, inserting
tuples and building primary keys.)
3. The performance measurement factor is the wall clock time for
pgbench -i (measured with time command).
4. Values represents the Wall clock time for 4 runs and the last value
represents the average of all the runs.

pgbench settings:
Scale factor: 300 ( Database size - 4.3873 GB)

Test results:
1) Synchronous Replication : (126.98, 133.83, 127.77, 129.70,
Avg-129.57) (second)
2) Synchronous Replication + synchronous_transfer (commit) : (132.87,
125.85, 133.91, 134.61, Avg-131.81) (second)
3) Synchronous Replication + synchronous_transfer (all) : (133.59,
132.82, 134.20, 135.22, Avg-133.95) (second)

4) Asynchronous Replication : ( 126.75 , 136.95, 130.42,
127.77, 130.47) (second)
5) Asynchronous Replication + synchronous_transfer (commit) : (128.13,
133.06, 127.62, 130.70, Avg-129.87) (second)
6) Asynchronous Replication + synchronous_transfer (data_flush) :
(134.55 , 139.90, 144.47, 143.85, Avg-140.69) (second)

In above test results the performance numbers are very close to each
other, also because of noise they show variation. hence following is
approximate conclusion about the overhead of patch.

1. Streaming replication + synchronous_transfer (all , data_flush)::
a) On an average, synchronous replication combined with
synchronous_transfer (all ) causes 3.38 % performance overhead.
b) On an average, asynchronous streaming replication combined
synchronous_transfer (data_flush ) causes averagely 7.83 % performance
overhead.

2. Streaming replication + synchronous_transfer (commit):
a) On an average, synchronous replication combined with
synchronous_transfer (commit ) causes 1.72 % performance overhead.
b) On an average, asynchronous streaming replication combined with
synchronous_transfer (commit) causes averagely (-0.45) % performance
overhead.

The test results for both the cases (large loads and fast
transactions) shows variation because of noise, But we can observe
that approximately patch causes 3-4% performance overhead.

Regards,
Samrat Revgade

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Sawada Masahiko (#57)

Re: Patch for fail-back without fresh backup

On Thu, 2013-07-11 at 23:42 +0900, Sawada Masahiko wrote:

please find the attached patch.

Please fix these compiler warnings:

xlog.c:3117:2: warning: implicit declaration of function ‘SyncRepWaitForLSN’ [-Wimplicit-function-declaration]
syncrep.c:414:6: warning: variable ‘numdataflush’ set but not used [-Wunused-but-set-variable]

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Peter Eisentraut (#59)

Re: Patch for fail-back without fresh backup

On Sat, Aug 24, 2013 at 11:38 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On Thu, 2013-07-11 at 23:42 +0900, Sawada Masahiko wrote:

please find the attached patch.

Please fix these compiler warnings:

xlog.c:3117:2: warning: implicit declaration of function ‘SyncRepWaitForLSN’ [-Wimplicit-function-declaration]
syncrep.c:414:6: warning: variable ‘numdataflush’ set but not used [-Wunused-but-set-variable]

Thank you for your information!

We are improving the patch for Commit Fest 2 now.
We will fix above compiler warnings as soon as possible and submit the patch

--
Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Samrat Revagade

revagade.samrat@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#60)

2 attachment(s)

Re: Patch for fail-back without fresh backup

We are improving the patch for Commit Fest 2 now.
We will fix above compiler warnings as soon as possible and submit the
patch

Attached *synchronous_transfer_v5.patch* implements review comments from
commit fest-1 and reduces the performance overhead of synchronous_transfer.

*synchronous_transfer_documentation_v1.patch* adds fail back safe standby
mechanism in the PostgreSQL documentation.

Sawada-san worked very hard to get this thing done, most of the work is
done by him. I really appreciate his efforts :)

**Brief description of suggestions from commit fest -1 are as follows:

1) Fail back-safe standby is not an appropriate name [done - changed it to
synchronous_transfer].

2) Remove extra set of postgresql.conf parameters [ done - Now there is
only one additional postgresql.conf parameter *synchronous_transfer* which
controls the synchronous nature of WAL transfer].

3) Performance overhead measurement [ done- with fast transaction
workloads and large loads index builds ].

4) Put the SyncRepWaitForLSN inside XLogFlush [ Not the correct way - as
SyncRepWaitForLSN will go in critical section and any error ( it will do
network I/O and may also sleep) inside critical section leads to server
PANIC and restart ].

5) Old master's WAL ahead of new master's WAL - [ we overcome with this
by deleting all WAL files of old master details can be found here :
https://wiki.postgresql.org/wiki/Synchronous_Transfer]

**Changes to postgresql.conf to configure fail back safe standby:

1) Synchronous fail-back safe standby

synchronous_standby_names = <server name>

synchronous_transfer = all

2) Asynchronous fail-back safe standby

synchronous_standby_names = <server name>

synchronous_transfer = data_flush

3) Pure synchronous standby

synchronous_standby_names = <server name>

synchronous_transfer = commit

4) Pure asynchronous standby

synchronous_transfer = commit

**Restriction:

If multiple standby servers connect to the master, then the standby with
synchronous replication becomes failback safe standby.

for example: if there are 2 standby servers which connects to master server
(one is SYNC, another one is ASYNC) and synchronous_transfer is set 'all'.

Then SYNC standby becomes failback safe standby and master server will wait
only for SYNC standby server.

**Performance overhead of synchronous_transfer patch:

Tests are performed with pgbench benchmark with following configuration
options:

Transaction type: TPC-B (sort of)

Scaling factor: 300

Query mode: simple

Number of clients: 150

Number of threads: 1

Duration: 1800 s

Real time scenarios mostly based on fast transaction workloads for which
synchronous_transfer have negligible overhead.

** 1. Test for fast transaction workloads [measured w.r.t default
replication in PostgreSQL, pgbench benchmark - TPS value]:

a. On an average performance overhead caused by synchronous
standby: 0.0102 %.

b. On an average performance overhead caused by synchronous failback safe
standby: 0.2943 %.

c. On an average performance overhead caused by Asynchronous
standby: 0.04321 %.

d. On an average performance overhead caused by asynchronous failback safe
standby: 0.5141 %

**2. Test for large loads and index builds [measured w.r.t default
replication in PostgreSQL, pgbench benchmark (-i option) - time in seconds]:

a. On an average performance overhead caused by synchronous standby: 3.51
%.

b. On an average performance overhead caused by synchronous failback safe
standby: 14.88%.

c. On an average performance overhead caused by Asynchronous
standby: 0.4887%.

d. On an average performance overhead caused by asynchronous failback safe
standby: 10.19%

**TO-DO:

More discussion is needed regarding usefulness/need and priority on
following. any feedback is appreciated:

1. Support for multiple fail back safe standbys.

2. Current design of patch will wait forever for the failback safe standby
like Streaming replication.

3. Support for cascaded failback standby

---
Regards,

Samrat Revagade

Attachments:

synchronous_transfer_documentation_v1.patchapplication/octet-stream; name=synchronous_transfer_documentation_v1.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 23ebc11..0b8d614 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1749,6 +1749,50 @@ include 'filename'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-synchronous-transfer" xreflabel="synchronous_transfer">
+      <term><varname>synchronous_transfer</varname> (<type>enum</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_transfer</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        This parameter controls the synchronous nature of WAL transfer and
+        maintains file system level consistency between master server and
+        standby server. Specifies whether master server will wait for file
+        system level change (for example : modifying data page) before
+        corresponding WAL records are replicated to the standby server.
+       </para>
+       <para>
+        Valid values are <literal>commit</>, <literal>data_flush</> and
+        <literal>all</>. The default value is <literal>commit</>, meaning
+        that master only wait for transaction commits, this is equivalent
+        to turning off <literal>synchronous_transfer</> parameter and standby
+        server will behave as a <quote>synchronous standby </> in
+        Streaming Replication. When <literal>data_flush</>, master will
+        wait only for data page modifications but not for transaction
+        commits, hence the standby server will act as <quote>asynchronous
+        failback safe standby</>. When <literal> all</>, master will wait
+        for data page modifications as well as for transaction commits and
+        resultant standby server will act as <quote>synchronous failback safe
+        standby</>, to configure synchronous failback safe standby
+        <xref linkend="guc-synchronous-standby-names"> should be set.
+       </para>
+       <para>
+        Setting this parameter to <literal> commit</> will configure pure
+        Streaming Replication, on the other hand setting to <literal>
+        data_flush </> will make WAL transfer synchronous except transaction
+        commits. All WAL transfer can be made synchronous by setting this
+        parameter to <literal>all</> value.
+       </para>
+       <para>
+        Setting <literal>synchronous_transfer</> to <literal>data_flush</> or
+        <literal>all</> makes WAL transfer synchronous, but this wait is mostly
+        on background activities such as bgwriter. Hence this will not create
+        much performance overhead.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-sync-method" xreflabel="wal_sync_method">
       <term><varname>wal_sync_method</varname> (<type>enum</type>)</term>
       <indexterm>
@@ -2258,14 +2302,25 @@ include 'filename'
       </indexterm>
       <listitem>
        <para>
-        Specifies a comma-separated list of standby names that can support
-        <firstterm>synchronous replication</>, as described in
-        <xref linkend="synchronous-replication">.
-        At any one time there will be at most one active synchronous standby;
-        transactions waiting for commit will be allowed to proceed after
-        this standby server confirms receipt of their data.
-        The synchronous standby will be the first standby named in this list
-        that is both currently connected and streaming data in real-time
+        Specifies a comma-separated list of standby names. If this parameter
+        is set then standby will behave as synchronous standby in replication,
+        as described in <xref linkend="synchronous-replication"> or synchronous
+        failback safe standby, as described in <xref linkend="failback-safe">.
+        At any time there will be at most one active standby; when standby is
+        synchronous standby in replication, transactions waiting for commit
+        will be allowed to proceed after this standby server confirms receipt
+        of their data. But when standby is synchronous failback safe standby
+        data page modifications as well as transaction commits will be allowed
+        to proceed only after this standby server confirms receipt of their data.
+        If this parameter is set to empty value and
+        <xref linkend="guc-synchronous-transfer"> is set to <literal>data_flush</>
+        then standby is called as asynchronous failback safe standby and only
+        data page modifications will wait before corresponding WAL record is
+        replicated to standby.
+       </para>
+       <para>
+        Synchronous standby in replication will be the first standby named in
+        this list that is both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c8f6fa8..b2b42be 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1140,6 +1140,70 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 
    </sect3>
   </sect2>
+
+  <sect2 id="failback-safe">
+     <title>Setting up failback safe standby</title>
+
+   <indexterm zone="high-availability">
+       <primary>Setting up failback safe standby</primary>
+   </indexterm>
+
+   <para>
+    PostgreSQL Streaming Replication offers durability, but if master
+    crashes and particular WAL record is unable to reach to standby
+    server, then that WAL record is present on master server but not
+    on standby server. In such a case master is ahead of standby server
+    in term of WAL records and Data in database. This will lead to
+    file-system level inconsistency between master and standby server.
+   </para>
+
+   <para>
+    Due to this inconsistency fresh backup of new master onto new standby
+    is needed to re-prepare HA cluster. Taking fresh backup can be very
+    time consuming process when database is of large size. In such a case
+    disaster recovery can take very long time if Streaming Replication is
+    used to setup the high availability cluster. The reason for this is,
+    Synchronous Replication makes WAL transfer synchronous at the time of
+    transaction commit. This will ensure the durability until the HA cluster
+    is up, but there are certain cases such as heap page update which will
+    cause inconsistency at the time of disaster if the standby is failed to
+    receive WAL record corresponding to this heap page update.
+   </para>
+
+   <para>
+    If HA cluster is configured with failback safe standby then master will
+    wait for data page modifications before corresponding WAL record is replicated
+    to standby. Failback safe standby has a control over all WAL transfer
+    and will not make any file system level change until gets a confirmation
+    from standby server. Hence avoids the need of fresh backup by maintaining
+    consistency.
+   </para>
+
+   <sect3 id="Failback-safe-config">
+    <title>Basic Configuration</title>
+   <para>
+    Failback safe standby can be asynchronous or synchronous in nature.
+    This will depend upon whether master will wait for transaction commit
+    or not. By default failback safe mechanism is turned off.
+   </para>
+
+   <para>
+    The first step to configure HA with failback safe standby is to setup
+    streaming replication. Configuring synchronous failback safe standby
+    requires setting up  <xref linkend="guc-synchronous-transfer"> to
+    <literal>all</> and <xref linkend="guc-synchronous-standby-names">
+    must be set to a non-empty value. This configuration will cause each
+    commit and data page modification to wait for confirmation that standby
+    has written corresponding WAL record to durable storage. Configuring
+    asynchronous failback safe standby requires only setting up
+     <xref linkend="guc-synchronous-transfer"> to <literal> data_flush</>.
+    This configuration will cause only data page modifications to wait
+    for confirmation that standby has written corresponding WAL record
+    to durable storage.
+   </para>
+
+  </sect3>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
@@ -1201,12 +1265,28 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
    </para>
 
    <para>
-    So, switching from primary to standby server can be fast but requires
-    some time to re-prepare the failover cluster. Regular switching from
-    primary to standby is useful, since it allows regular downtime on
-    each system for maintenance. This also serves as a test of the
-    failover mechanism to ensure that it will really work when you need it.
-    Written administration procedures are advised.
+    At the time of failover there is a possibility of file-system level
+    inconsistency between old primary and old standby server hence
+    fresh backup from new master onto old master is needed for Configuring
+    former primary server as a new standby server. Without taking fresh
+    backup even if the new standby starts, streaming replication does not
+    start successfully. The activity of taking backup can be fast for small
+    database but for large database requires more time to re-prepare the
+    failover cluster and could break the service line agreement of crash
+    recovery. This situation can arise when HA cluster is configured through
+    Streaming Replication. The need of fresh backup and problem of long
+    recovery time can be solved by using if HA cluster is configured with
+    failback safe standby see <xref linkend="failback-safe">.
+    Failback safe standby makes WAL transfer synchronous at required places
+    and maintains the file-system level consistency between master and standby
+    server and the former primary can be easily configured as new standby server.
+   </para>
+
+   <para>
+    Regular switching from primary to standby is useful, since it allows
+    regular downtime on each system for maintenance. This also serves as
+    a test of the failover mechanism to ensure that it will really work
+    when you need it. Written administration procedures are advised.
    </para>
 
    <para>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 7868fe4..ec3bb53 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1557,6 +1557,14 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
        corruption) in case of a crash of the <emphasis>database</> alone.
       </para>
      </listitem>
+
+     <listitem>
+      <para>
+       Set <xref linkend="guc-synchronous-transfer"> to commit; there is no
+       need to guard against database inconsistency between master and standby.
+       and it is feasible to take fresh backup at failback time.
+      </para>
+     </listitem>
     </itemizedlist>
    </para>
   </sect1>

synchronous_transfer_v5.patchapplication/octet-stream; name=synchronous_transfer_v5.patchDownload

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index cb95aa3..d216b2e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -37,6 +37,8 @@
 #include "access/transam.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -708,8 +710,10 @@ WriteZeroPageXlogRec(int pageno)
 /*
  * Write a TRUNCATE xlog record
  *
- * We must flush the xlog record to disk before returning --- see notes
- * in TruncateCLOG().
+ * Before returning we must flush the xlog record to disk
+ * and if synchronous transfer is requested wait for failback
+ * safe standby to receive WAL up to recptr.
+ * --- see notes in TruncateCLOG().
  */
 static void
 WriteTruncateXlogRec(int pageno)
@@ -723,6 +727,12 @@ WriteTruncateXlogRec(int pageno)
 	rdata.next = NULL;
 	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
 	XLogFlush(recptr);
+
+	/*
+	 * Wait for failback safe standby.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(recptr, true, true);
 }
 
 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5e53593..edaee83 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,6 +54,8 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
@@ -744,6 +746,12 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 			START_CRIT_SECTION();
 			XLogFlush(max_lsn);
 			END_CRIT_SECTION();
+
+			/* If synchronous transfer is requested, wait for failback safe
+			 * standby to receive WAL up to max_lsn.
+			 */
+			if (SyncTransRequested())
+				SyncRepWaitForLSN(max_lsn, true, true);
 		}
 	}
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e975f8d..38a9e9c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1091,12 +1091,12 @@ EndPrepare(GlobalTransaction gxact)
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked the prepare, but still show as
 	 * running in the procarray (twice!) and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(gxact->prepare_lsn);
+	SyncRepWaitForLSN(gxact->prepare_lsn, false, true);
 
 	records.tail = records.head = NULL;
 }
@@ -2058,12 +2058,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
 
 /*
@@ -2138,10 +2138,10 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0591f3f..25210df 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1189,13 +1189,13 @@ RecordTransactionCommit(void)
 	latestXid = TransactionIdLatest(xid, nchildren, children);
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
 	if (wrote_xlog)
-		SyncRepWaitForLSN(XactLastRecEnd);
+		SyncRepWaitForLSN(XactLastRecEnd, false, true);
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	XactLastRecEnd = 0;
@@ -4690,8 +4690,17 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 	 * for any user that requested ForceSyncCommit().
 	 */
 	if (XactCompletionForceSyncCommit(xinfo))
+	{
 		XLogFlush(lsn);
 
+		/*
+		 * If synchronous transfer is requested, wait for failback safe
+		 * standby to receive WAL up to lsn,
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
+
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dc47c47..e8e118c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -39,8 +39,10 @@
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/barrier.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -8278,6 +8280,18 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
+	 * If synchronous transfer is requested, wait for failback safe standby
+	 * to receive WAL up to checkpoint WAL record. Otherwise if failure occurs
+	 * before standby receives CHECKPOINT WAL record causes an inconsistency
+	 * between control files of master and standby. Because of this master will
+	 * start from a location which is not known to the standby at the time fail-over.
+	 *
+	 * There is no need to wait for shutdown CHECKPOINT.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(recptr, true, !shutdown);
+
+	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
 	smgrpostckpt();
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 971a149..050a6ba 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -25,6 +25,8 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
@@ -288,6 +290,14 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		 */
 		if (fsm || vm)
 			XLogFlush(lsn);
+
+		/*
+		 * If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to lsn. Otherwise, we may have a situation where
+		 * the heap is truncated, but the action never replayed on the standby.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
 	}
 
 	/* Do the real work */
@@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
 		 */
 		XLogFlush(lsn);
 
+		/*
+		 * If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to lsn.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
+
 		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
 
 		/* Also tell xlogutils.c about it */
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 8cf1346..f5cc21c 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -66,6 +66,8 @@ char	   *SyncRepStandbyNames;
 static bool announce_next_takeover = true;
 
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+static int	SyncTransferMode = SYNC_REP_NO_WAIT;
+int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
@@ -83,20 +85,30 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
- * Wait for synchronous replication, if requested by user.
+ * Wait for synchronous/failback safe standby, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
- * change that state to SYNC_REP_WAITING before adding ourselves
- * to the wait queue. During SyncRepWakeQueue() a WALSender changes
- * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
- * This backend then resets its state to SYNC_REP_NOT_WAITING.
+ * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
+ * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
+ * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
+ * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
+ *
+ * ForDataFlush - if TRUE, we wait before flushing data page.
+ * Otherwise wait for the sync standby
+ *
+ * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
+ * the standby has already made progressed upto the given XactCommitLSN
+ *
+ * Return TRUE if either the synchronous standby/failback safe standby is not
+ * configured/turned off OR the standby has made enough progress
  */
-void
-SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+bool
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
 {
 	char	   *new_status = NULL;
 	const char *old_status;
-	int			mode = SyncRepWaitMode;
+	int			mode = !ForDataFlush ? SyncRepWaitMode : SyncTransferMode;
+	bool		ret;
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
@@ -104,7 +116,26 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * need to be connected.
 	 */
 	if (!SyncRepRequested() || !SyncStandbysDefined())
-		return;
+		return true;
+
+	/*
+	 * If the caller has specified ForDataFlush, but synchronous transfer
+	 * is not specified or its turned off, exit.
+	 *
+	 * We would like to allow the failback safe mechanism even for cascaded
+	 * standbys as well. But we can't really wait for the standby to catch
+	 * up until we reach a consistent state since the standbys won't be
+	 * even able to connect without us reaching in that state (XXX Confirm)
+	 */
+	if ((!SyncTransRequested()) && ForDataFlush)
+		return true;
+
+	/*
+	 * If the caller has not specified ForDataFlush, but synchronous commit
+	 * is skipped by values of synchronous_transfer, exit.
+	 */
+	if (IsSyncRepSkipped() && !ForDataFlush)
+		return true;
 
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 	Assert(WalSndCtl != NULL);
@@ -120,11 +151,20 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * condition but we'll be fetching that cache line anyway so it's likely to
 	 * be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
 		XactCommitLSN <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
-		return;
+		return true;
+	}
+
+	/*
+	 * Exit if we are told not to block on the standby.
+	 */
+	if (!Wait)
+	{
+		LWLockRelease(SyncRepLock);
+		return false;
 	}
 
 	/*
@@ -151,6 +191,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		new_status[len] = '\0'; /* truncate off " waiting ..." */
 	}
 
+	ret = false;
+
 	/*
 	 * Wait for specified LSN to be confirmed.
 	 *
@@ -187,7 +229,10 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			LWLockRelease(SyncRepLock);
 		}
 		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+		{
+			ret = true;
 			break;
+		}
 
 		/*
 		 * If a wait for synchronous replication is pending, we can neither
@@ -264,6 +309,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		set_ps_display(new_status, false);
 		pfree(new_status);
 	}
+
+	return ret;
 }
 
 /*
@@ -371,6 +418,7 @@ SyncRepReleaseWaiters(void)
 	volatile WalSnd *syncWalSnd = NULL;
 	int			numwrite = 0;
 	int			numflush = 0;
+	int			numdataflush = 0;
 	int			priority = 0;
 	int			i;
 
@@ -438,13 +486,20 @@ SyncRepReleaseWaiters(void)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
 		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
+
+	}
+	if (walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] < MyWalSnd->flush)
+	{
+		walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] = MyWalSnd->flush;
+		numdataflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_DATA_FLUSH);
+
 	}
 
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
 		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
-	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
+		 numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
 
 	/*
 	 * If we are managing the highest priority standby, though we weren't
@@ -710,3 +765,18 @@ assign_synchronous_commit(int newval, void *extra)
 			break;
 	}
 }
+
+void
+assign_synchronous_transfer(int newval, void *extra)
+{
+	switch (newval)
+	{
+		case SYNCHRONOUS_TRANSFER_ALL:
+		case SYNCHRONOUS_TRANSFER_DATA_FLUSH:
+			SyncTransferMode = SYNC_REP_WAIT_DATA_FLUSH;
+			break;
+		default:
+			SyncTransferMode = SYNC_REP_NO_WAIT;
+			break;
+	}
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index afd559d..492e039 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1539,6 +1539,10 @@ XLogSend(bool *caughtup)
 
 		*caughtup = true;
 
+		elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <= sentPtr(%X/%X) AND sendTImeLine",
+			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
+			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
+
 		elog(DEBUG1, "walsender reached end of timeline at %X/%X (sent up to %X/%X)",
 			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
 			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f848391..7a2e285 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -41,6 +41,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -1975,8 +1977,14 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	 * skip the flush if the buffer isn't permanent.
 	 */
 	if (buf->flags & BM_PERMANENT)
+	{
 		XLogFlush(recptr);
-
+		/* If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to recptr.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(recptr, true, true);
+	}
 	/*
 	 * Now it's safe to write buffer to disk. Note that no one else should
 	 * have been able to write it while we were busy with log flushing because
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 18f0342..e92b607 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -48,6 +48,8 @@
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
@@ -711,6 +713,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	int			fd;
 	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
+	XLogRecPtr	lsn=InvalidXLogRecPtr;
 
 	/*
 	 * Fill in the overhead fields and update CRC.
@@ -753,7 +756,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	{
 		xl_relmap_update xlrec;
 		XLogRecData rdata[2];
-		XLogRecPtr	lsn;
 
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
@@ -775,6 +777,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 
 		/* As always, WAL must hit the disk before the data update does */
 		XLogFlush(lsn);
+
 	}
 
 	errno = 0;
@@ -849,6 +852,13 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
+
+	/*
+	 * If synchronous transfer is requested, wait for failback safe
+	 * standby to receive WAL up to recptr.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(lsn, true, true);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7d297bc..7e226a5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -381,6 +381,18 @@ static const struct config_enum_entry synchronous_commit_options[] = {
 };
 
 /*
+ * Although only "all", "data_flush", and "commit" are documented, we
+ * accept all the likely variants of "off".
+ */
+static const struct config_enum_entry synchronous_transfer_options[] = {
+	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},
+	{"0", SYNCHRONOUS_TRANSFER_COMMIT, true},
+	{NULL, 0, false}
+};
+
+/*
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
@@ -3288,6 +3300,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Sets the data flush synchronization level"),
+			NULL
+		},
+		&synchronous_transfer,
+		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+		NULL, assign_synchronous_transfer, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..d6603c2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -220,6 +220,8 @@
 #synchronous_standby_names = ''	# standby servers that provide sync rep
 				# comma-separated list of application_name
 				# from standby(s); '*' = all
+#synchronous_transfer = commit	# data page synchronization level
+				# commit, data_flush or all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
 # - Standby Servers -
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index ed66c49..6cf3f26 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -60,6 +60,8 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
 #include "utils/tqual.h"
@@ -115,6 +117,18 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 
 		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
 			return;				/* not flushed yet, so don't set hint */
+
+		/*
+		 * If synchronous transfer is requested, we check if the commit WAL record
+		 * has made to the standby before allowing hint bit updates. We should not
+		 * wait for the standby to receive the WAL since its OK to delay hint bit
+		 * updates.
+		 */
+		if (SyncTransRequested())
+		{
+			if(!SyncRepWaitForLSN(commitLSN, true, false))
+				return;
+		}
 	}
 
 	tuple->t_infomask |= infomask;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ac23ea6..4540625 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -19,23 +19,42 @@
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
 
+#define SyncTransRequested() \
+	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+
+#define IsSyncRepSkipped() \
+	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+
 /* SyncRepWaitMode */
-#define SYNC_REP_NO_WAIT		-1
-#define SYNC_REP_WAIT_WRITE		0
-#define SYNC_REP_WAIT_FLUSH		1
+#define SYNC_REP_NO_WAIT					-1
+#define SYNC_REP_WAIT_WRITE					0
+#define SYNC_REP_WAIT_FLUSH					1
+#define SYNC_REP_WAIT_DATA_FLUSH	2
 
-#define NUM_SYNC_REP_WAIT_MODE	2
+#define NUM_SYNC_REP_WAIT_MODE				3
 
 /* syncRepState */
-#define SYNC_REP_NOT_WAITING		0
-#define SYNC_REP_WAITING			1
-#define SYNC_REP_WAIT_COMPLETE		2
+#define SYNC_REP_NOT_WAITING					0
+#define SYNC_REP_WAITING						1
+#define SYNC_REP_WAIT_COMPLETE					2
+
+typedef enum
+{
+	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
+	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
+										 * no wait for WAL */
+	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush */
+}	SynchronousTransferLevel;
 
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
 
+/* user-settable parameters for failback safe replication */
+extern int	synchronous_transfer;
+
 /* called by user backend */
-extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
+		bool ForDataFlush, bool Wait);
 
 /* called at backend exit */
 extern void SyncRepCleanupAtProcExit(void);
@@ -52,5 +71,6 @@ extern int	SyncRepWakeQueue(bool all, int mode);
 
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_commit(int newval, void *extra);
+extern void assign_synchronous_transfer(int newval, void *extra);
 
 #endif   /* _SYNCREP_H */

#62

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Samrat Revagade (#61)

Re: Patch for fail-back without fresh backup

On 9/12/13 3:00 AM, Samrat Revagade wrote:

We are improving the patch for Commit Fest 2 now.
We will fix above compiler warnings as soon as possible and submit
the patch

Attached *synchronous_transfer_v5.patch* implements review comments from
commit fest-1 and reduces the performance overhead of synchronous_transfer.

There is still this compiler warning:

syncrep.c: In function ï¿½SyncRepReleaseWaitersï¿½:
syncrep.c:421:6: warning: variable ï¿½numdataflushï¿½ set but not used
[-Wunused-but-set-variable]

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Peter Eisentraut (#62)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Fri, Sep 13, 2013 at 1:11 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

On 9/12/13 3:00 AM, Samrat Revagade wrote:

We are improving the patch for Commit Fest 2 now.
We will fix above compiler warnings as soon as possible and submit
the patch

Attached *synchronous_transfer_v5.patch* implements review comments from
commit fest-1 and reduces the performance overhead of synchronous_transfer.

There is still this compiler warning:

syncrep.c: In function ‘SyncRepReleaseWaiters’:
syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
[-Wunused-but-set-variable]

Sorry I forgot fix it.

I have attached the patch which I modified.

Regards,

-------
Sawada Masahiko

Attachments:

synchronous_transfer_v6.patchapplication/octet-stream; name=synchronous_transfer_v6.patchDownload

*** a/src/backend/access/transam/clog.c
--- b/src/backend/access/transam/clog.c
***************
*** 37,42 ****
--- 37,44 ----
  #include "access/transam.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  
  /*
   * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
***************
*** 708,715 **** WriteZeroPageXlogRec(int pageno)
  /*
   * Write a TRUNCATE xlog record
   *
!  * We must flush the xlog record to disk before returning --- see notes
!  * in TruncateCLOG().
   */
  static void
  WriteTruncateXlogRec(int pageno)
--- 710,719 ----
  /*
   * Write a TRUNCATE xlog record
   *
!  * Before returning we must flush the xlog record to disk
!  * and if synchronous transfer is requested wait for failback
!  * safe standby to receive WAL up to recptr.
!  * --- see notes in TruncateCLOG().
   */
  static void
  WriteTruncateXlogRec(int pageno)
***************
*** 723,728 **** WriteTruncateXlogRec(int pageno)
--- 727,738 ----
  	rdata.next = NULL;
  	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
  	XLogFlush(recptr);
+ 
+ 	/*
+ 	 * Wait for failback safe standby.
+ 	 */
+ 	if (SyncTransRequested())
+ 		SyncRepWaitForLSN(recptr, true, true);
  }
  
  /*
*** a/src/backend/access/transam/slru.c
--- b/src/backend/access/transam/slru.c
***************
*** 54,59 ****
--- 54,61 ----
  #include "access/slru.h"
  #include "access/transam.h"
  #include "access/xlog.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/fd.h"
  #include "storage/shmem.h"
  #include "miscadmin.h"
***************
*** 744,749 **** SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
--- 746,757 ----
  			START_CRIT_SECTION();
  			XLogFlush(max_lsn);
  			END_CRIT_SECTION();
+ 
+ 			/* If synchronous transfer is requested, wait for failback safe
+ 			 * standby to receive WAL up to max_lsn.
+ 			 */
+ 			if (SyncTransRequested())
+ 				SyncRepWaitForLSN(max_lsn, true, true);
  		}
  	}
  
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1091,1102 **** EndPrepare(GlobalTransaction gxact)
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked the prepare, but still show as
  	 * running in the procarray (twice!) and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(gxact->prepare_lsn);
  
  	records.tail = records.head = NULL;
  }
--- 1091,1102 ----
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked the prepare, but still show as
  	 * running in the procarray (twice!) and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(gxact->prepare_lsn, false, true);
  
  	records.tail = records.head = NULL;
  }
***************
*** 2058,2069 **** RecordTransactionCommitPrepared(TransactionId xid,
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr);
  }
  
  /*
--- 2058,2069 ----
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr, false, true);
  }
  
  /*
***************
*** 2138,2147 **** RecordTransactionAbortPrepared(TransactionId xid,
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr);
  }
--- 2138,2147 ----
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr, false, true);
  }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 1189,1201 **** RecordTransactionCommit(void)
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
--- 1189,1201 ----
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd, false, true);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
***************
*** 4690,4697 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
--- 4690,4706 ----
  	 * for any user that requested ForceSyncCommit().
  	 */
  	if (XactCompletionForceSyncCommit(xinfo))
+ 	{
  		XLogFlush(lsn);
  
+ 		/*
+ 		 * If synchronous transfer is requested, wait for failback safe
+ 		 * standby to receive WAL up to lsn,
+ 		 */
+ 		if (SyncTransRequested())
+ 			SyncRepWaitForLSN(lsn, true, true);
+ 
+ 	}
  }
  
  /*
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 39,46 ****
--- 39,48 ----
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
  #include "postmaster/startup.h"
+ #include "replication/syncrep.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "replication/syncrep.h"
  #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
***************
*** 8286,8291 **** CreateCheckPoint(int flags)
--- 8288,8305 ----
  	END_CRIT_SECTION();
  
  	/*
+ 	 * If synchronous transfer is requested, wait for failback safe standby
+ 	 * to receive WAL up to checkpoint WAL record. Otherwise if failure occurs
+ 	 * before standby receives CHECKPOINT WAL record causes an inconsistency
+ 	 * between control files of master and standby. Because of this master will
+ 	 * start from a location which is not known to the standby at the time fail-over.
+ 	 *
+ 	 * There is no need to wait for shutdown CHECKPOINT.
+ 	 */
+ 	if (SyncTransRequested())
+ 		SyncRepWaitForLSN(recptr, true, !shutdown);
+ 
+ 	/*
  	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
  	 */
  	smgrpostckpt();
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 25,30 ****
--- 25,32 ----
  #include "catalog/catalog.h"
  #include "catalog/storage.h"
  #include "catalog/storage_xlog.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/freespace.h"
  #include "storage/smgr.h"
  #include "utils/memutils.h"
***************
*** 288,293 **** RelationTruncate(Relation rel, BlockNumber nblocks)
--- 290,303 ----
  		 */
  		if (fsm || vm)
  			XLogFlush(lsn);
+ 
+ 		/*
+ 		 * If synchronous transfer is requested, wait for failback safe standby
+ 		 * to receive WAL up to lsn. Otherwise, we may have a situation where
+ 		 * the heap is truncated, but the action never replayed on the standby.
+ 		 */
+ 		if (SyncTransRequested())
+ 			SyncRepWaitForLSN(lsn, true, true);
  	}
  
  	/* Do the real work */
***************
*** 521,526 **** smgr_redo(XLogRecPtr lsn, XLogRecord *record)
--- 531,543 ----
  		 */
  		XLogFlush(lsn);
  
+ 		/*
+ 		 * If synchronous transfer is requested, wait for failback safe standby
+ 		 * to receive WAL up to lsn.
+ 		 */
+ 		if (SyncTransRequested())
+ 			SyncRepWaitForLSN(lsn, true, true);
+ 
  		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
  
  		/* Also tell xlogutils.c about it */
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 65,70 **** char	   *SyncRepStandbyNames;
--- 65,72 ----
  static bool announce_next_takeover = true;
  
  static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+ static int	SyncTransferMode = SYNC_REP_NO_WAIT;
+ int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
  
  static void SyncRepQueueInsert(int mode);
  static void SyncRepCancelWait(void);
***************
*** 82,101 **** static bool SyncRepQueueIsOrderedByLSN(int mode);
   */
  
  /*
!  * Wait for synchronous replication, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING before adding ourselves
!  * to the wait queue. During SyncRepWakeQueue() a WALSender changes
!  * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
!  * This backend then resets its state to SYNC_REP_NOT_WAITING.
   */
! void
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
! 	int			mode = SyncRepWaitMode;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or there are no
--- 84,113 ----
   */
  
  /*
!  * Wait for synchronous/failback safe standby, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
!  * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
!  * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
!  * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
!  *
!  * ForDataFlush - if TRUE, we wait before flushing data page.
!  * Otherwise wait for the sync standby
!  *
!  * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
!  * the standby has already made progressed upto the given XactCommitLSN
!  *
!  * Return TRUE if either the synchronous standby/failback safe standby is not
!  * configured/turned off OR the standby has made enough progress
   */
! bool
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
! 	int			mode = !ForDataFlush ? SyncRepWaitMode : SyncTransferMode;
! 	bool		ret;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or there are no
***************
*** 103,109 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * need to be connected.
  	 */
  	if (!SyncRepRequested() || !SyncStandbysDefined())
! 		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
--- 115,140 ----
  	 * need to be connected.
  	 */
  	if (!SyncRepRequested() || !SyncStandbysDefined())
! 		return true;
! 
! 	/*
! 	 * If the caller has specified ForDataFlush, but synchronous transfer
! 	 * is not specified or its turned off, exit.
! 	 *
! 	 * We would like to allow the failback safe mechanism even for cascaded
! 	 * standbys as well. But we can't really wait for the standby to catch
! 	 * up until we reach a consistent state since the standbys won't be
! 	 * even able to connect without us reaching in that state (XXX Confirm)
! 	 */
! 	if ((!SyncTransRequested()) && ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If the caller has not specified ForDataFlush, but synchronous commit
! 	 * is skipped by values of synchronous_transfer, exit.
! 	 */
! 	if (IsSyncRepSkipped() && !ForDataFlush)
! 		return true;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
***************
*** 119,129 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * condition but we'll be fetching that cache line anyway so its likely to
  	 * be a low cost check.
  	 */
! 	if (!WalSndCtl->sync_standbys_defined ||
  		XactCommitLSN <= WalSndCtl->lsn[mode])
  	{
  		LWLockRelease(SyncRepLock);
! 		return;
  	}
  
  	/*
--- 150,169 ----
  	 * condition but we'll be fetching that cache line anyway so its likely to
  	 * be a low cost check.
  	 */
! 	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
  		XactCommitLSN <= WalSndCtl->lsn[mode])
  	{
  		LWLockRelease(SyncRepLock);
! 		return true;
! 	}
! 
! 	/*
! 	 * Exit if we are told not to block on the standby.
! 	 */
! 	if (!Wait)
! 	{
! 		LWLockRelease(SyncRepLock);
! 		return false;
  	}
  
  	/*
***************
*** 150,155 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 190,197 ----
  		new_status[len] = '\0'; /* truncate off " waiting ..." */
  	}
  
+ 	ret = false;
+ 
  	/*
  	 * Wait for specified LSN to be confirmed.
  	 *
***************
*** 186,192 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 228,237 ----
  			LWLockRelease(SyncRepLock);
  		}
  		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+ 		{
+ 			ret = true;
  			break;
+ 		}
  
  		/*
  		 * If a wait for synchronous replication is pending, we can neither
***************
*** 263,268 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 308,315 ----
  		set_ps_display(new_status, false);
  		pfree(new_status);
  	}
+ 
+ 	return ret;
  }
  
  /*
***************
*** 370,375 **** SyncRepReleaseWaiters(void)
--- 417,423 ----
  	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
+ 	int			numdataflush = 0;
  	int			priority = 0;
  	int			i;
  
***************
*** 437,449 **** SyncRepReleaseWaiters(void)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
! 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
--- 485,505 ----
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
+ 
+ 	}
+ 	if (walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] < MyWalSnd->flush)
+ 	{
+ 		walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] = MyWalSnd->flush;
+ 		numdataflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_DATA_FLUSH);
+ 
  	}
  
  	LWLockRelease(SyncRepLock);
  
! 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to data flush %X/%X",
! 		 numwrite    , (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 		 numflush    , (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush,
! 		 numdataflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
***************
*** 709,711 **** assign_synchronous_commit(int newval, void *extra)
--- 765,782 ----
  			break;
  	}
  }
+ 
+ void
+ assign_synchronous_transfer(int newval, void *extra)
+ {
+ 	switch (newval)
+ 	{
+ 		case SYNCHRONOUS_TRANSFER_ALL:
+ 		case SYNCHRONOUS_TRANSFER_DATA_FLUSH:
+ 			SyncTransferMode = SYNC_REP_WAIT_DATA_FLUSH;
+ 			break;
+ 		default:
+ 			SyncTransferMode = SYNC_REP_NO_WAIT;
+ 			break;
+ 	}
+ }
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 1539,1544 **** XLogSend(bool *caughtup)
--- 1539,1548 ----
  
  		*caughtup = true;
  
+ 		elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <= sentPtr(%X/%X) AND sendTImeLine",
+ 			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
+ 			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
+ 
  		elog(DEBUG1, "walsender reached end of timeline at %X/%X (sent up to %X/%X)",
  			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
  			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 41,46 ****
--- 41,48 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***************
*** 1975,1982 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
  		XLogFlush(recptr);
! 
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
  	 * have been able to write it while we were busy with log flushing because
--- 1977,1990 ----
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
+ 	{
  		XLogFlush(recptr);
! 		/* If synchronous transfer is requested, wait for failback safe standby
! 		 * to receive WAL up to recptr.
! 		 */
! 		if (SyncTransRequested())
! 			SyncRepWaitForLSN(recptr, true, true);
! 	}
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
  	 * have been able to write it while we were busy with log flushing because
*** a/src/backend/utils/cache/relmapper.c
--- b/src/backend/utils/cache/relmapper.c
***************
*** 48,53 ****
--- 48,55 ----
  #include "catalog/pg_tablespace.h"
  #include "catalog/storage.h"
  #include "miscadmin.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/fd.h"
  #include "storage/lwlock.h"
  #include "utils/inval.h"
***************
*** 711,716 **** write_relmap_file(bool shared, RelMapFile *newmap,
--- 713,719 ----
  	int			fd;
  	RelMapFile *realmap;
  	char		mapfilename[MAXPGPATH];
+ 	XLogRecPtr	lsn=InvalidXLogRecPtr;
  
  	/*
  	 * Fill in the overhead fields and update CRC.
***************
*** 753,759 **** write_relmap_file(bool shared, RelMapFile *newmap,
  	{
  		xl_relmap_update xlrec;
  		XLogRecData rdata[2];
- 		XLogRecPtr	lsn;
  
  		/* now errors are fatal ... */
  		START_CRIT_SECTION();
--- 756,761 ----
***************
*** 775,780 **** write_relmap_file(bool shared, RelMapFile *newmap,
--- 777,783 ----
  
  		/* As always, WAL must hit the disk before the data update does */
  		XLogFlush(lsn);
+ 
  	}
  
  	errno = 0;
***************
*** 849,854 **** write_relmap_file(bool shared, RelMapFile *newmap,
--- 852,864 ----
  	/* Critical section done */
  	if (write_wal)
  		END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * If synchronous transfer is requested, wait for failback safe
+ 	 * standby to receive WAL up to recptr.
+ 	 */
+ 	if (SyncTransRequested())
+ 		SyncRepWaitForLSN(lsn, true, true);
  }
  
  /*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 381,386 **** static const struct config_enum_entry synchronous_commit_options[] = {
--- 381,398 ----
  };
  
  /*
+  * Although only "all", "data_flush", and "commit" are documented, we
+  * accept all the likely variants of "off".
+  */
+ static const struct config_enum_entry synchronous_transfer_options[] = {
+ 	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+ 	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+ 	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},
+ 	{"0", SYNCHRONOUS_TRANSFER_COMMIT, true},
+ 	{NULL, 0, false}
+ };
+ 
+ /*
   * Options for enum values stored in other modules
   */
  extern const struct config_enum_entry wal_level_options[];
***************
*** 3288,3293 **** static struct config_enum ConfigureNamesEnum[] =
--- 3300,3315 ----
  	},
  
  	{
+ 		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+ 			gettext_noop("Sets the data flush synchronization level"),
+ 			NULL
+ 		},
+ 		&synchronous_transfer,
+ 		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+ 		NULL, assign_synchronous_transfer, NULL
+ 	},
+ 
+ 	{
  		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
  			gettext_noop("Enables logging of recovery-related debugging information."),
  			gettext_noop("Each level includes all the levels that follow it. The later"
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 220,225 ****
--- 220,227 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_transfer = commit	# data page synchronization level
+ 				# commit, data_flush or all
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 60,65 ****
--- 60,67 ----
  #include "access/subtrans.h"
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "replication/walsender.h"
+ #include "replication/syncrep.h"
  #include "storage/bufmgr.h"
  #include "storage/procarray.h"
  #include "utils/tqual.h"
***************
*** 115,120 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer,
--- 117,134 ----
  
  		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
  			return;				/* not flushed yet, so don't set hint */
+ 
+ 		/*
+ 		 * If synchronous transfer is requested, we check if the commit WAL record
+ 		 * has made to the standby before allowing hint bit updates. We should not
+ 		 * wait for the standby to receive the WAL since its OK to delay hint bit
+ 		 * updates.
+ 		 */
+ 		if (SyncTransRequested())
+ 		{
+ 			if(!SyncRepWaitForLSN(commitLSN, true, false))
+ 				return;
+ 		}
  	}
  
  	tuple->t_infomask |= infomask;
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 19,41 ****
  #define SyncRepRequested() \
  	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  
  /* SyncRepWaitMode */
! #define SYNC_REP_NO_WAIT		-1
! #define SYNC_REP_WAIT_WRITE		0
! #define SYNC_REP_WAIT_FLUSH		1
  
! #define NUM_SYNC_REP_WAIT_MODE	2
  
  /* syncRepState */
! #define SYNC_REP_NOT_WAITING		0
! #define SYNC_REP_WAITING			1
! #define SYNC_REP_WAIT_COMPLETE		2
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
  /* called by user backend */
! extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);
--- 19,60 ----
  #define SyncRepRequested() \
  	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  
+ #define SyncTransRequested() \
+ 	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+ 
+ #define IsSyncRepSkipped() \
+ 	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+ 
  /* SyncRepWaitMode */
! #define SYNC_REP_NO_WAIT					-1
! #define SYNC_REP_WAIT_WRITE					0
! #define SYNC_REP_WAIT_FLUSH					1
! #define SYNC_REP_WAIT_DATA_FLUSH	2
  
! #define NUM_SYNC_REP_WAIT_MODE				3
  
  /* syncRepState */
! #define SYNC_REP_NOT_WAITING					0
! #define SYNC_REP_WAITING						1
! #define SYNC_REP_WAIT_COMPLETE					2
! 
! typedef enum
! {
! 	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
! 	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
! 										 * no wait for WAL */
! 	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush */
! }	SynchronousTransferLevel;
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
+ /* user-settable parameters for failback safe replication */
+ extern int	synchronous_transfer;
+ 
  /* called by user backend */
! extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
! 		bool ForDataFlush, bool Wait);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);
***************
*** 52,56 **** extern int	SyncRepWakeQueue(bool all, int mode);
--- 71,76 ----
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);
+ extern void assign_synchronous_transfer(int newval, void *extra);
  
  #endif   /* _SYNCREP_H */

#64

Samrat Revagade

revagade.samrat@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#63)

1 attachment(s)

Re: Patch for fail-back without fresh backup

syncrep.c: In function ‘SyncRepReleaseWaiters’:
syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
[-Wunused-but-set-variable]

Sorry I forgot fix it.

I have attached the patch which I modified.

Attached patch combines documentation patch and source-code patch.

--
Regards,

Samrat Revgade

Attachments:

synchronous_transfer_v7.patchapplication/octet-stream; name=synchronous_transfer_v7.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 370aa09..bc8891c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1749,6 +1749,50 @@ include 'filename'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-synchronous-transfer" xreflabel="synchronous_transfer">
+      <term><varname>synchronous_transfer</varname> (<type>enum</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_transfer</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        This parameter controls the synchronous nature of WAL transfer and
+        maintains file system level consistency between master server and
+        standby server. Specifies whether master server will wait for file
+        system level change (for example : modifying data page) before
+        corresponding WAL records are replicated to the standby server.
+       </para>
+       <para>
+        Valid values are <literal>commit</>, <literal>data_flush</> and
+        <literal>all</>. The default value is <literal>commit</>, meaning
+        that master only wait for transaction commits, this is equivalent
+        to turning off <literal>synchronous_transfer</> parameter and standby
+        server will behave as a <quote>synchronous standby </> in
+        Streaming Replication. When <literal>data_flush</>, master will
+        wait only for data page modifications but not for transaction
+        commits, hence the standby server will act as <quote>asynchronous
+        failback safe standby</>. When <literal> all</>, master will wait
+        for data page modifications as well as for transaction commits and
+        resultant standby server will act as <quote>synchronous failback safe
+        standby</>, to configure synchronous failback safe standby
+        <xref linkend="guc-synchronous-standby-names"> should be set.
+       </para>
+       <para>
+        Setting this parameter to <literal> commit</> will configure pure
+        Streaming Replication, on the other hand setting to <literal>
+        data_flush </> will make WAL transfer synchronous except transaction
+        commits. All WAL transfer can be made synchronous by setting this
+        parameter to <literal>all</> value.
+       </para>
+       <para>
+        Setting <literal>synchronous_transfer</> to <literal>data_flush</> or
+        <literal>all</> makes WAL transfer synchronous, but this wait is mostly
+        on background activities such as bgwriter. Hence this will not create
+        much performance overhead.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-sync-method" xreflabel="wal_sync_method">
       <term><varname>wal_sync_method</varname> (<type>enum</type>)</term>
       <indexterm>
@@ -2258,14 +2302,25 @@ include 'filename'
       </indexterm>
       <listitem>
        <para>
-        Specifies a comma-separated list of standby names that can support
-        <firstterm>synchronous replication</>, as described in
-        <xref linkend="synchronous-replication">.
-        At any one time there will be at most one active synchronous standby;
-        transactions waiting for commit will be allowed to proceed after
-        this standby server confirms receipt of their data.
-        The synchronous standby will be the first standby named in this list
-        that is both currently connected and streaming data in real-time
+        Specifies a comma-separated list of standby names. If this parameter
+        is set then standby will behave as synchronous standby in replication,
+        as described in <xref linkend="synchronous-replication"> or synchronous
+        failback safe standby, as described in <xref linkend="failback-safe">.
+        At any time there will be at most one active standby; when standby is
+        synchronous standby in replication, transactions waiting for commit
+        will be allowed to proceed after this standby server confirms receipt
+        of their data. But when standby is synchronous failback safe standby
+        data page modifications as well as transaction commits will be allowed
+        to proceed only after this standby server confirms receipt of their data.
+        If this parameter is set to empty value and
+        <xref linkend="guc-synchronous-transfer"> is set to <literal>data_flush</>
+        then standby is called as asynchronous failback safe standby and only
+        data page modifications will wait before corresponding WAL record is
+        replicated to standby.
+       </para>
+       <para>
+        Synchronous standby in replication will be the first standby named in
+        this list that is both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c8f6fa8..b2b42be 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1140,6 +1140,70 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 
    </sect3>
   </sect2>
+
+  <sect2 id="failback-safe">
+     <title>Setting up failback safe standby</title>
+
+   <indexterm zone="high-availability">
+       <primary>Setting up failback safe standby</primary>
+   </indexterm>
+
+   <para>
+    PostgreSQL Streaming Replication offers durability, but if master
+    crashes and particular WAL record is unable to reach to standby
+    server, then that WAL record is present on master server but not
+    on standby server. In such a case master is ahead of standby server
+    in term of WAL records and Data in database. This will lead to
+    file-system level inconsistency between master and standby server.
+   </para>
+
+   <para>
+    Due to this inconsistency fresh backup of new master onto new standby
+    is needed to re-prepare HA cluster. Taking fresh backup can be very
+    time consuming process when database is of large size. In such a case
+    disaster recovery can take very long time if Streaming Replication is
+    used to setup the high availability cluster. The reason for this is,
+    Synchronous Replication makes WAL transfer synchronous at the time of
+    transaction commit. This will ensure the durability until the HA cluster
+    is up, but there are certain cases such as heap page update which will
+    cause inconsistency at the time of disaster if the standby is failed to
+    receive WAL record corresponding to this heap page update.
+   </para>
+
+   <para>
+    If HA cluster is configured with failback safe standby then master will
+    wait for data page modifications before corresponding WAL record is replicated
+    to standby. Failback safe standby has a control over all WAL transfer
+    and will not make any file system level change until gets a confirmation
+    from standby server. Hence avoids the need of fresh backup by maintaining
+    consistency.
+   </para>
+
+   <sect3 id="Failback-safe-config">
+    <title>Basic Configuration</title>
+   <para>
+    Failback safe standby can be asynchronous or synchronous in nature.
+    This will depend upon whether master will wait for transaction commit
+    or not. By default failback safe mechanism is turned off.
+   </para>
+
+   <para>
+    The first step to configure HA with failback safe standby is to setup
+    streaming replication. Configuring synchronous failback safe standby
+    requires setting up  <xref linkend="guc-synchronous-transfer"> to
+    <literal>all</> and <xref linkend="guc-synchronous-standby-names">
+    must be set to a non-empty value. This configuration will cause each
+    commit and data page modification to wait for confirmation that standby
+    has written corresponding WAL record to durable storage. Configuring
+    asynchronous failback safe standby requires only setting up
+     <xref linkend="guc-synchronous-transfer"> to <literal> data_flush</>.
+    This configuration will cause only data page modifications to wait
+    for confirmation that standby has written corresponding WAL record
+    to durable storage.
+   </para>
+
+  </sect3>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
@@ -1201,12 +1265,28 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
    </para>
 
    <para>
-    So, switching from primary to standby server can be fast but requires
-    some time to re-prepare the failover cluster. Regular switching from
-    primary to standby is useful, since it allows regular downtime on
-    each system for maintenance. This also serves as a test of the
-    failover mechanism to ensure that it will really work when you need it.
-    Written administration procedures are advised.
+    At the time of failover there is a possibility of file-system level
+    inconsistency between old primary and old standby server hence
+    fresh backup from new master onto old master is needed for Configuring
+    former primary server as a new standby server. Without taking fresh
+    backup even if the new standby starts, streaming replication does not
+    start successfully. The activity of taking backup can be fast for small
+    database but for large database requires more time to re-prepare the
+    failover cluster and could break the service line agreement of crash
+    recovery. This situation can arise when HA cluster is configured through
+    Streaming Replication. The need of fresh backup and problem of long
+    recovery time can be solved by using if HA cluster is configured with
+    failback safe standby see <xref linkend="failback-safe">.
+    Failback safe standby makes WAL transfer synchronous at required places
+    and maintains the file-system level consistency between master and standby
+    server and the former primary can be easily configured as new standby server.
+   </para>
+
+   <para>
+    Regular switching from primary to standby is useful, since it allows
+    regular downtime on each system for maintenance. This also serves as
+    a test of the failover mechanism to ensure that it will really work
+    when you need it. Written administration procedures are advised.
    </para>
 
    <para>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 2af1738..da3820f 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1569,6 +1569,14 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
        corruption) in case of a crash of the <emphasis>database</> alone.
       </para>
      </listitem>
+
+     <listitem>
+      <para>
+       Set <xref linkend="guc-synchronous-transfer"> to commit; there is no
+       need to guard against database inconsistency between master and standby.
+       and it is feasible to take fresh backup at failback time.
+      </para>
+     </listitem>
     </itemizedlist>
    </para>
   </sect1>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index cb95aa3..d216b2e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -37,6 +37,8 @@
 #include "access/transam.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -708,8 +710,10 @@ WriteZeroPageXlogRec(int pageno)
 /*
  * Write a TRUNCATE xlog record
  *
- * We must flush the xlog record to disk before returning --- see notes
- * in TruncateCLOG().
+ * Before returning we must flush the xlog record to disk
+ * and if synchronous transfer is requested wait for failback
+ * safe standby to receive WAL up to recptr.
+ * --- see notes in TruncateCLOG().
  */
 static void
 WriteTruncateXlogRec(int pageno)
@@ -723,6 +727,12 @@ WriteTruncateXlogRec(int pageno)
 	rdata.next = NULL;
 	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
 	XLogFlush(recptr);
+
+	/*
+	 * Wait for failback safe standby.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(recptr, true, true);
 }
 
 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5e53593..edaee83 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,6 +54,8 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
@@ -744,6 +746,12 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 			START_CRIT_SECTION();
 			XLogFlush(max_lsn);
 			END_CRIT_SECTION();
+
+			/* If synchronous transfer is requested, wait for failback safe
+			 * standby to receive WAL up to max_lsn.
+			 */
+			if (SyncTransRequested())
+				SyncRepWaitForLSN(max_lsn, true, true);
 		}
 	}
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e975f8d..38a9e9c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1091,12 +1091,12 @@ EndPrepare(GlobalTransaction gxact)
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked the prepare, but still show as
 	 * running in the procarray (twice!) and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(gxact->prepare_lsn);
+	SyncRepWaitForLSN(gxact->prepare_lsn, false, true);
 
 	records.tail = records.head = NULL;
 }
@@ -2058,12 +2058,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
 
 /*
@@ -2138,10 +2138,10 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0591f3f..25210df 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1189,13 +1189,13 @@ RecordTransactionCommit(void)
 	latestXid = TransactionIdLatest(xid, nchildren, children);
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
 	if (wrote_xlog)
-		SyncRepWaitForLSN(XactLastRecEnd);
+		SyncRepWaitForLSN(XactLastRecEnd, false, true);
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	XactLastRecEnd = 0;
@@ -4690,8 +4690,17 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 	 * for any user that requested ForceSyncCommit().
 	 */
 	if (XactCompletionForceSyncCommit(xinfo))
+	{
 		XLogFlush(lsn);
 
+		/*
+		 * If synchronous transfer is requested, wait for failback safe
+		 * standby to receive WAL up to lsn,
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
+
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fc495d6..ef46419 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -39,8 +39,10 @@
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/barrier.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -8282,6 +8284,18 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
+	 * If synchronous transfer is requested, wait for failback safe standby
+	 * to receive WAL up to checkpoint WAL record. Otherwise if failure occurs
+	 * before standby receives CHECKPOINT WAL record causes an inconsistency
+	 * between control files of master and standby. Because of this master will
+	 * start from a location which is not known to the standby at the time fail-over.
+	 *
+	 * There is no need to wait for shutdown CHECKPOINT.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(recptr, true, !shutdown);
+
+	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
 	smgrpostckpt();
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 971a149..050a6ba 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -25,6 +25,8 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
@@ -288,6 +290,14 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		 */
 		if (fsm || vm)
 			XLogFlush(lsn);
+
+		/*
+		 * If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to lsn. Otherwise, we may have a situation where
+		 * the heap is truncated, but the action never replayed on the standby.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
 	}
 
 	/* Do the real work */
@@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
 		 */
 		XLogFlush(lsn);
 
+		/*
+		 * If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to lsn.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
+
 		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
 
 		/* Also tell xlogutils.c about it */
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 8cf1346..f410b9c 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -66,6 +66,8 @@ char	   *SyncRepStandbyNames;
 static bool announce_next_takeover = true;
 
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+static int	SyncTransferMode = SYNC_REP_NO_WAIT;
+int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
@@ -83,20 +85,30 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
- * Wait for synchronous replication, if requested by user.
+ * Wait for synchronous/failback safe standby, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
- * change that state to SYNC_REP_WAITING before adding ourselves
- * to the wait queue. During SyncRepWakeQueue() a WALSender changes
- * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
- * This backend then resets its state to SYNC_REP_NOT_WAITING.
+ * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
+ * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
+ * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
+ * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
+ *
+ * ForDataFlush - if TRUE, we wait before flushing data page.
+ * Otherwise wait for the sync standby
+ *
+ * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
+ * the standby has already made progressed upto the given XactCommitLSN
+ *
+ * Return TRUE if either the synchronous standby/failback safe standby is not
+ * configured/turned off OR the standby has made enough progress
  */
-void
-SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+bool
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
 {
 	char	   *new_status = NULL;
 	const char *old_status;
-	int			mode = SyncRepWaitMode;
+	int			mode = !ForDataFlush ? SyncRepWaitMode : SyncTransferMode;
+	bool		ret;
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
@@ -104,7 +116,26 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * need to be connected.
 	 */
 	if (!SyncRepRequested() || !SyncStandbysDefined())
-		return;
+		return true;
+
+	/*
+	 * If the caller has specified ForDataFlush, but synchronous transfer
+	 * is not specified or its turned off, exit.
+	 *
+	 * We would like to allow the failback safe mechanism even for cascaded
+	 * standbys as well. But we can't really wait for the standby to catch
+	 * up until we reach a consistent state since the standbys won't be
+	 * even able to connect without us reaching in that state (XXX Confirm)
+	 */
+	if ((!SyncTransRequested()) && ForDataFlush)
+		return true;
+
+	/*
+	 * If the caller has not specified ForDataFlush, but synchronous commit
+	 * is skipped by values of synchronous_transfer, exit.
+	 */
+	if (IsSyncRepSkipped() && !ForDataFlush)
+		return true;
 
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 	Assert(WalSndCtl != NULL);
@@ -120,11 +151,20 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * condition but we'll be fetching that cache line anyway so it's likely to
 	 * be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
 		XactCommitLSN <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
-		return;
+		return true;
+	}
+
+	/*
+	 * Exit if we are told not to block on the standby.
+	 */
+	if (!Wait)
+	{
+		LWLockRelease(SyncRepLock);
+		return false;
 	}
 
 	/*
@@ -151,6 +191,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		new_status[len] = '\0'; /* truncate off " waiting ..." */
 	}
 
+	ret = false;
+
 	/*
 	 * Wait for specified LSN to be confirmed.
 	 *
@@ -187,7 +229,10 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			LWLockRelease(SyncRepLock);
 		}
 		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+		{
+			ret = true;
 			break;
+		}
 
 		/*
 		 * If a wait for synchronous replication is pending, we can neither
@@ -264,6 +309,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		set_ps_display(new_status, false);
 		pfree(new_status);
 	}
+
+	return ret;
 }
 
 /*
@@ -371,6 +418,7 @@ SyncRepReleaseWaiters(void)
 	volatile WalSnd *syncWalSnd = NULL;
 	int			numwrite = 0;
 	int			numflush = 0;
+	int			numdataflush = 0;
 	int			priority = 0;
 	int			i;
 
@@ -438,13 +486,21 @@ SyncRepReleaseWaiters(void)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
 		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
+
+	}
+	if (walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] < MyWalSnd->flush)
+	{
+		walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] = MyWalSnd->flush;
+		numdataflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_DATA_FLUSH);
+
 	}
 
 	LWLockRelease(SyncRepLock);
 
-	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
-		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
-	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
+	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to data flush %X/%X",
+		 numwrite    , (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
+		 numflush    , (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush,
+		 numdataflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
 
 	/*
 	 * If we are managing the highest priority standby, though we weren't
@@ -710,3 +766,18 @@ assign_synchronous_commit(int newval, void *extra)
 			break;
 	}
 }
+
+void
+assign_synchronous_transfer(int newval, void *extra)
+{
+	switch (newval)
+	{
+		case SYNCHRONOUS_TRANSFER_ALL:
+		case SYNCHRONOUS_TRANSFER_DATA_FLUSH:
+			SyncTransferMode = SYNC_REP_WAIT_DATA_FLUSH;
+			break;
+		default:
+			SyncTransferMode = SYNC_REP_NO_WAIT;
+			break;
+	}
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index afd559d..492e039 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1539,6 +1539,10 @@ XLogSend(bool *caughtup)
 
 		*caughtup = true;
 
+		elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <= sentPtr(%X/%X) AND sendTImeLine",
+			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
+			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
+
 		elog(DEBUG1, "walsender reached end of timeline at %X/%X (sent up to %X/%X)",
 			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
 			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f848391..7a2e285 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -41,6 +41,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -1975,8 +1977,14 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	 * skip the flush if the buffer isn't permanent.
 	 */
 	if (buf->flags & BM_PERMANENT)
+	{
 		XLogFlush(recptr);
-
+		/* If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to recptr.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(recptr, true, true);
+	}
 	/*
 	 * Now it's safe to write buffer to disk. Note that no one else should
 	 * have been able to write it while we were busy with log flushing because
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 18f0342..e92b607 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -48,6 +48,8 @@
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
@@ -711,6 +713,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	int			fd;
 	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
+	XLogRecPtr	lsn=InvalidXLogRecPtr;
 
 	/*
 	 * Fill in the overhead fields and update CRC.
@@ -753,7 +756,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	{
 		xl_relmap_update xlrec;
 		XLogRecData rdata[2];
-		XLogRecPtr	lsn;
 
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
@@ -775,6 +777,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 
 		/* As always, WAL must hit the disk before the data update does */
 		XLogFlush(lsn);
+
 	}
 
 	errno = 0;
@@ -849,6 +852,13 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
+
+	/*
+	 * If synchronous transfer is requested, wait for failback safe
+	 * standby to receive WAL up to recptr.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(lsn, true, true);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3107f9c..ccac724 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -381,6 +381,18 @@ static const struct config_enum_entry synchronous_commit_options[] = {
 };
 
 /*
+ * Although only "all", "data_flush", and "commit" are documented, we
+ * accept all the likely variants of "off".
+ */
+static const struct config_enum_entry synchronous_transfer_options[] = {
+	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},
+	{"0", SYNCHRONOUS_TRANSFER_COMMIT, true},
+	{NULL, 0, false}
+};
+
+/*
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
@@ -3300,6 +3312,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Sets the data flush synchronization level"),
+			NULL
+		},
+		&synchronous_transfer,
+		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+		NULL, assign_synchronous_transfer, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..d6603c2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -220,6 +220,8 @@
 #synchronous_standby_names = ''	# standby servers that provide sync rep
 				# comma-separated list of application_name
 				# from standby(s); '*' = all
+#synchronous_transfer = commit	# data page synchronization level
+				# commit, data_flush or all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
 # - Standby Servers -
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index ed66c49..6cf3f26 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -60,6 +60,8 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
 #include "utils/tqual.h"
@@ -115,6 +117,18 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 
 		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
 			return;				/* not flushed yet, so don't set hint */
+
+		/*
+		 * If synchronous transfer is requested, we check if the commit WAL record
+		 * has made to the standby before allowing hint bit updates. We should not
+		 * wait for the standby to receive the WAL since its OK to delay hint bit
+		 * updates.
+		 */
+		if (SyncTransRequested())
+		{
+			if(!SyncRepWaitForLSN(commitLSN, true, false))
+				return;
+		}
 	}
 
 	tuple->t_infomask |= infomask;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ac23ea6..4540625 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -19,23 +19,42 @@
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
 
+#define SyncTransRequested() \
+	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+
+#define IsSyncRepSkipped() \
+	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+
 /* SyncRepWaitMode */
-#define SYNC_REP_NO_WAIT		-1
-#define SYNC_REP_WAIT_WRITE		0
-#define SYNC_REP_WAIT_FLUSH		1
+#define SYNC_REP_NO_WAIT					-1
+#define SYNC_REP_WAIT_WRITE					0
+#define SYNC_REP_WAIT_FLUSH					1
+#define SYNC_REP_WAIT_DATA_FLUSH	2
 
-#define NUM_SYNC_REP_WAIT_MODE	2
+#define NUM_SYNC_REP_WAIT_MODE				3
 
 /* syncRepState */
-#define SYNC_REP_NOT_WAITING		0
-#define SYNC_REP_WAITING			1
-#define SYNC_REP_WAIT_COMPLETE		2
+#define SYNC_REP_NOT_WAITING					0
+#define SYNC_REP_WAITING						1
+#define SYNC_REP_WAIT_COMPLETE					2
+
+typedef enum
+{
+	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
+	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
+										 * no wait for WAL */
+	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush */
+}	SynchronousTransferLevel;
 
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
 
+/* user-settable parameters for failback safe replication */
+extern int	synchronous_transfer;
+
 /* called by user backend */
-extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
+		bool ForDataFlush, bool Wait);
 
 /* called at backend exit */
 extern void SyncRepCleanupAtProcExit(void);
@@ -52,5 +71,6 @@ extern int	SyncRepWakeQueue(bool all, int mode);
 
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_commit(int newval, void *extra);
+extern void assign_synchronous_transfer(int newval, void *extra);
 
 #endif   /* _SYNCREP_H */

#65

Fujii Masao

masao.fujii@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#64)

Re: Patch for fail-back without fresh backup

On Tue, Sep 17, 2013 at 3:45 PM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:

syncrep.c: In function ‘SyncRepReleaseWaiters’:
syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
[-Wunused-but-set-variable]

Sorry I forgot fix it.

I have attached the patch which I modified.

Attached patch combines documentation patch and source-code patch.

I set up synchronous replication with synchronous_transfer = all, and then I ran
pgbench -i and executed CHECKPOINT in the master. After that, when I executed
CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
synchronous_transfer feature.

How does synchronous_transfer work with cascade replication? If it's set to all
in the "sender-side" standby, it can resolve the data page inconsistency between
two standbys?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Fujii Masao (#65)

Re: Patch for fail-back without fresh backup

On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Sep 17, 2013 at 3:45 PM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:

syncrep.c: In function ‘SyncRepReleaseWaiters’:
syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
[-Wunused-but-set-variable]

Sorry I forgot fix it.

I have attached the patch which I modified.

Attached patch combines documentation patch and source-code patch.

I set up synchronous replication with synchronous_transfer = all, and then I ran
pgbench -i and executed CHECKPOINT in the master. After that, when I executed
CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
synchronous_transfer feature.

Did you set synchronous_standby_names in the standby server?
If so, the master server waits for the standby server which is set to
synchronous_standby_names.
Please let me know detail of this case.

How does synchronous_transfer work with cascade replication? If it's set to all
in the "sender-side" standby, it can resolve the data page inconsistency between
two standbys?

Currently patch supports the case which two servers are set up SYNC replication.
IWO, failback safe standby is the same as SYNC replication standby.
User can set synchronous_transfer in only master side.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Fujii Masao

masao.fujii@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#66)

Re: Patch for fail-back without fresh backup

On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I set up synchronous replication with synchronous_transfer = all, and then I ran
pgbench -i and executed CHECKPOINT in the master. After that, when I executed
CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
synchronous_transfer feature.

Did you set synchronous_standby_names in the standby server?

Yes.

If so, the master server waits for the standby server which is set to
synchronous_standby_names.
Please let me know detail of this case.

Both master and standby have the same postgresql.conf settings as follows:

max_wal_senders = 4
wal_level = hot_standby
wal_keep_segments = 32
synchronous_standby_names = '*'
synchronous_transfer = all

How does synchronous_transfer work with cascade replication? If it's set to all
in the "sender-side" standby, it can resolve the data page inconsistency between
two standbys?

Currently patch supports the case which two servers are set up SYNC replication.
IWO, failback safe standby is the same as SYNC replication standby.
User can set synchronous_transfer in only master side.

So, it's very strange that CHECKPOINT on the standby gets stuck infinitely.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Fujii Masao (#67)

Re: Patch for fail-back without fresh backup

On Wed, Sep 18, 2013 at 11:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I set up synchronous replication with synchronous_transfer = all, and then I ran
pgbench -i and executed CHECKPOINT in the master. After that, when I executed
CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
synchronous_transfer feature.

Did you set synchronous_standby_names in the standby server?

Yes.

If so, the master server waits for the standby server which is set to
synchronous_standby_names.
Please let me know detail of this case.

Both master and standby have the same postgresql.conf settings as follows:

max_wal_senders = 4
wal_level = hot_standby
wal_keep_segments = 32
synchronous_standby_names = '*'
synchronous_transfer = all

How does synchronous_transfer work with cascade replication? If it's set to all
in the "sender-side" standby, it can resolve the data page inconsistency between
two standbys?

Currently patch supports the case which two servers are set up SYNC replication.
IWO, failback safe standby is the same as SYNC replication standby.
User can set synchronous_transfer in only master side.

So, it's very strange that CHECKPOINT on the standby gets stuck infinitely.

yes I think so.
I was not considering that user set synchronous_standby_names in the
standby server.
it will ocurr
I will fix it considering this case.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#68)

Re: Patch for fail-back without fresh backup

On Wed, Sep 18, 2013 at 1:05 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Wed, Sep 18, 2013 at 11:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I set up synchronous replication with synchronous_transfer = all, and then I ran
pgbench -i and executed CHECKPOINT in the master. After that, when I executed
CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
synchronous_transfer feature.

Did you set synchronous_standby_names in the standby server?

Yes.

If so, the master server waits for the standby server which is set to
synchronous_standby_names.
Please let me know detail of this case.

Both master and standby have the same postgresql.conf settings as follows:

max_wal_senders = 4
wal_level = hot_standby
wal_keep_segments = 32
synchronous_standby_names = '*'
synchronous_transfer = all

How does synchronous_transfer work with cascade replication? If it's set to all
in the "sender-side" standby, it can resolve the data page inconsistency between
two standbys?

Currently patch supports the case which two servers are set up SYNC replication.
IWO, failback safe standby is the same as SYNC replication standby.
User can set synchronous_transfer in only master side.

So, it's very strange that CHECKPOINT on the standby gets stuck infinitely.

Sorry I sent mail by mistake.

yes I think so.

It waits for corresponding WAL replicated.
Behaviour of synchronous_transfer is similar to
synchronous_standby_names and synchronous replication little.
That is, if those parameter is set but the standby server doesn't
connect to the master server,
the master server waits for corresponding WAL replicated to standby
server infinitely.

I was not considering that user set synchronous_standby_names in the
standby server.
I will fix it considering this case.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Fujii Masao (#67)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Wed, Sep 18, 2013 at 11:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I set up synchronous replication with synchronous_transfer = all, and then I ran
pgbench -i and executed CHECKPOINT in the master. After that, when I executed
CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
synchronous_transfer feature.

Did you set synchronous_standby_names in the standby server?

Yes.

If so, the master server waits for the standby server which is set to
synchronous_standby_names.
Please let me know detail of this case.

Both master and standby have the same postgresql.conf settings as follows:

max_wal_senders = 4
wal_level = hot_standby
wal_keep_segments = 32
synchronous_standby_names = '*'
synchronous_transfer = all

How does synchronous_transfer work with cascade replication? If it's set to all
in the "sender-side" standby, it can resolve the data page inconsistency between
two standbys?

Currently patch supports the case which two servers are set up SYNC replication.
IWO, failback safe standby is the same as SYNC replication standby.
User can set synchronous_transfer in only master side.

So, it's very strange that CHECKPOINT on the standby gets stuck infinitely.

I attached the patch which I have modified.

I have modified that if both synchronous replication and synchronous
transfer are requested,
but the server still in recovery(i.g., the server is in standby mode),
the server doesn't wait for
corresponding WAL replicated.
Specifically, I added condition RecoveryInProgress().

If both functions(synchronous replication and transfer) are set and
user sets up synchronous replication between two servers,
user can executes CHECKPOINT on standby side. It will not wait for
corresponding WAL replicated.
But, If both parameter are set and user doesn't set up synchronous
replication(i.g., the master server works alone),
the master server waits infinitely when user executes CHECKPOINT. This
behaviour is similar to synchronous replication.

Regards,

-------
Sawada Masahiko

Attachments:

synchronous_transfer_v8.patchapplication/octet-stream; name=synchronous_transfer_v8.patchDownload

*** a/src/backend/access/transam/clog.c
--- b/src/backend/access/transam/clog.c
***************
*** 37,42 ****
--- 37,44 ----
  #include "access/transam.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  
  /*
   * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
***************
*** 708,715 **** WriteZeroPageXlogRec(int pageno)
  /*
   * Write a TRUNCATE xlog record
   *
!  * We must flush the xlog record to disk before returning --- see notes
!  * in TruncateCLOG().
   */
  static void
  WriteTruncateXlogRec(int pageno)
--- 710,719 ----
  /*
   * Write a TRUNCATE xlog record
   *
!  * Before returning we must flush the xlog record to disk
!  * and if synchronous transfer is requested wait for failback
!  * safe standby to receive WAL up to recptr.
!  * --- see notes in TruncateCLOG().
   */
  static void
  WriteTruncateXlogRec(int pageno)
***************
*** 723,728 **** WriteTruncateXlogRec(int pageno)
--- 727,738 ----
  	rdata.next = NULL;
  	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
  	XLogFlush(recptr);
+ 
+ 	/*
+ 	 * Wait for failback safe standby.
+ 	 */
+ 	if (SyncTransRequested())
+ 		SyncRepWaitForLSN(recptr, true, true);
  }
  
  /*
*** a/src/backend/access/transam/slru.c
--- b/src/backend/access/transam/slru.c
***************
*** 54,59 ****
--- 54,61 ----
  #include "access/slru.h"
  #include "access/transam.h"
  #include "access/xlog.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/fd.h"
  #include "storage/shmem.h"
  #include "miscadmin.h"
***************
*** 700,705 **** SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
--- 702,714 ----
  			START_CRIT_SECTION();
  			XLogFlush(max_lsn);
  			END_CRIT_SECTION();
+ 
+ 			/*
+ 			 * If synchronous transfer is requested, wait for failback safe
+ 			 * standby to receive WAL up to max_lsn.
+ 			 */
+ 			if (SyncTransRequested())
+ 				SyncRepWaitForLSN(max_lsn, true, true);
  		}
  	}
  
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1091,1102 **** EndPrepare(GlobalTransaction gxact)
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked the prepare, but still show as
  	 * running in the procarray (twice!) and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(gxact->prepare_lsn);
  
  	records.tail = records.head = NULL;
  }
--- 1091,1102 ----
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked the prepare, but still show as
  	 * running in the procarray (twice!) and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(gxact->prepare_lsn, false, true);
  
  	records.tail = records.head = NULL;
  }
***************
*** 2058,2069 **** RecordTransactionCommitPrepared(TransactionId xid,
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr);
  }
  
  /*
--- 2058,2069 ----
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr, false, true);
  }
  
  /*
***************
*** 2138,2147 **** RecordTransactionAbortPrepared(TransactionId xid,
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr);
  }
--- 2138,2147 ----
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr, false, true);
  }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 1189,1201 **** RecordTransactionCommit(void)
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
--- 1189,1201 ----
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd, false, true);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
***************
*** 4690,4697 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
--- 4690,4706 ----
  	 * for any user that requested ForceSyncCommit().
  	 */
  	if (XactCompletionForceSyncCommit(xinfo))
+ 	{
  		XLogFlush(lsn);
  
+ 		/*
+ 		 * If synchronous transfer is requested, wait for failback safe
+ 		 * standby to receive WAL up to lsn,
+ 		 */
+ 		if (SyncTransRequested())
+ 			SyncRepWaitForLSN(lsn, true, true);
+ 
+ 	}
  }
  
  /*
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 39,46 ****
--- 39,48 ----
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
  #include "postmaster/startup.h"
+ #include "replication/syncrep.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "replication/syncrep.h"
  #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
***************
*** 8286,8291 **** CreateCheckPoint(int flags)
--- 8288,8305 ----
  	END_CRIT_SECTION();
  
  	/*
+ 	 * If synchronous transfer is requested, wait for failback safe standby
+ 	 * to receive WAL up to checkpoint WAL record. Otherwise if failure occurs
+ 	 * before standby receives CHECKPOINT WAL record causes an inconsistency
+ 	 * between control files of master and standby. Because of this master will
+ 	 * start from a location which is not known to the standby at the time fail-over.
+ 	 *
+ 	 * There is no need to wait for shutdown CHECKPOINT.
+ 	 */
+ 	if (SyncTransRequested())
+ 		SyncRepWaitForLSN(recptr, true, !shutdown);
+ 
+ 	/*
  	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
  	 */
  	smgrpostckpt();
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 25,30 ****
--- 25,32 ----
  #include "catalog/catalog.h"
  #include "catalog/storage.h"
  #include "catalog/storage_xlog.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/freespace.h"
  #include "storage/smgr.h"
  #include "utils/memutils.h"
***************
*** 288,293 **** RelationTruncate(Relation rel, BlockNumber nblocks)
--- 290,303 ----
  		 */
  		if (fsm || vm)
  			XLogFlush(lsn);
+ 
+ 		/*
+ 		 * If synchronous transfer is requested, wait for failback safe standby
+ 		 * to receive WAL up to lsn. Otherwise, we may have a situation where
+ 		 * the heap is truncated, but the action never replayed on the standby.
+ 		 */
+ 		if (SyncTransRequested())
+ 			SyncRepWaitForLSN(lsn, true, true);
  	}
  
  	/* Do the real work */
***************
*** 521,526 **** smgr_redo(XLogRecPtr lsn, XLogRecord *record)
--- 531,543 ----
  		 */
  		XLogFlush(lsn);
  
+ 		/*
+ 		 * If synchronous transfer is requested, wait for failback safe standby
+ 		 * to receive WAL up to lsn.
+ 		 */
+ 		if (SyncTransRequested())
+ 			SyncRepWaitForLSN(lsn, true, true);
+ 
  		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
  
  		/* Also tell xlogutils.c about it */
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 65,70 **** char	   *SyncRepStandbyNames;
--- 65,72 ----
  static bool announce_next_takeover = true;
  
  static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+ static int	SyncTransferMode = SYNC_REP_NO_WAIT;
+ int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
  
  static void SyncRepQueueInsert(int mode);
  static void SyncRepCancelWait(void);
***************
*** 82,101 **** static bool SyncRepQueueIsOrderedByLSN(int mode);
   */
  
  /*
!  * Wait for synchronous replication, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING before adding ourselves
!  * to the wait queue. During SyncRepWakeQueue() a WALSender changes
!  * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
!  * This backend then resets its state to SYNC_REP_NOT_WAITING.
   */
! void
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
! 	int			mode = SyncRepWaitMode;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or there are no
--- 84,114 ----
   */
  
  /*
!  * Wait for synchronous/failback safe standby, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
!  * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
!  * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
!  * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
!  *
!  * ForDataFlush - if TRUE, we wait before flushing data page.
!  * Otherwise wait for the sync standby
!  *
!  * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
!  * the standby has already made progressed upto the given XactCommitLSN
!  *
!  * Return TRUE if either the synchronous standby/failback safe standby is not
!  * configured/turned off OR the standby has made enough progress
   */
! bool
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
! 	int			mode = !ForDataFlush ? SyncRepWaitMode : SyncTransferMode;
! 	bool		ret;
! 	int			i;
  
  	/*
  	 * Fast exit if user has not requested sync replication, or there are no
***************
*** 103,109 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * need to be connected.
  	 */
  	if (!SyncRepRequested() || !SyncStandbysDefined())
! 		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
--- 116,148 ----
  	 * need to be connected.
  	 */
  	if (!SyncRepRequested() || !SyncStandbysDefined())
! 		return true;
! 
! 	/*
! 	 * If the caller has specified ForDataFlush, but synchronous transfer
! 	 * is not specified or its turned off, exit.
! 	 *
! 	 * We would like to allow the failback safe mechanism even for cascaded
! 	 * standbys as well. But we can't really wait for the standby to catch
! 	 * up until we reach a consistent state since the standbys won't be
! 	 * even able to connect without us reaching in that state (XXX Confirm)
! 	 */
! 	if ((!SyncTransRequested()) && ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If the caller has not specified ForDataFlush, but synchronous commit
! 	 * is skipped by values of synchronous_transfer, exit.
! 	 */
! 	if (IsSyncRepSkipped() && !ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If both synchronous replication and synchronous transfer
! 	 * are requested but the system still in recovery, exit.
! 	 */
! 	if (RecoveryInProgress())
! 		return true;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
***************
*** 119,129 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * condition but we'll be fetching that cache line anyway so its likely to
  	 * be a low cost check.
  	 */
! 	if (!WalSndCtl->sync_standbys_defined ||
  		XactCommitLSN <= WalSndCtl->lsn[mode])
  	{
  		LWLockRelease(SyncRepLock);
! 		return;
  	}
  
  	/*
--- 158,177 ----
  	 * condition but we'll be fetching that cache line anyway so its likely to
  	 * be a low cost check.
  	 */
! 	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
  		XactCommitLSN <= WalSndCtl->lsn[mode])
  	{
  		LWLockRelease(SyncRepLock);
! 		return true;
! 	}
! 
! 	/*
! 	 * Exit if we are told not to block on the standby.
! 	 */
! 	if (!Wait)
! 	{
! 		LWLockRelease(SyncRepLock);
! 		return false;
  	}
  
  	/*
***************
*** 150,155 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 198,205 ----
  		new_status[len] = '\0'; /* truncate off " waiting ..." */
  	}
  
+ 	ret = false;
+ 
  	/*
  	 * Wait for specified LSN to be confirmed.
  	 *
***************
*** 186,192 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 236,245 ----
  			LWLockRelease(SyncRepLock);
  		}
  		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+ 		{
+ 			ret = true;
  			break;
+ 		}
  
  		/*
  		 * If a wait for synchronous replication is pending, we can neither
***************
*** 263,268 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 316,323 ----
  		set_ps_display(new_status, false);
  		pfree(new_status);
  	}
+ 
+ 	return ret;
  }
  
  /*
***************
*** 370,375 **** SyncRepReleaseWaiters(void)
--- 425,431 ----
  	volatile WalSnd *syncWalSnd = NULL;
  	int			numwrite = 0;
  	int			numflush = 0;
+ 	int			numdataflush = 0;
  	int			priority = 0;
  	int			i;
  
***************
*** 437,449 **** SyncRepReleaseWaiters(void)
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
  	}
  
  	LWLockRelease(SyncRepLock);
  
! 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! 		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
--- 493,513 ----
  	{
  		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
  		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
+ 
+ 	}
+ 	if (walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] < MyWalSnd->flush)
+ 	{
+ 		walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] = MyWalSnd->flush;
+ 		numdataflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_DATA_FLUSH);
+ 
  	}
  
  	LWLockRelease(SyncRepLock);
  
! 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to data flush %X/%X",
! 		 numwrite    , (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
! 		 numflush    , (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush,
! 		 numdataflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
  
  	/*
  	 * If we are managing the highest priority standby, though we weren't
***************
*** 709,711 **** assign_synchronous_commit(int newval, void *extra)
--- 773,790 ----
  			break;
  	}
  }
+ 
+ void
+ assign_synchronous_transfer(int newval, void *extra)
+ {
+ 	switch (newval)
+ 	{
+ 		case SYNCHRONOUS_TRANSFER_ALL:
+ 		case SYNCHRONOUS_TRANSFER_DATA_FLUSH:
+ 			SyncTransferMode = SYNC_REP_WAIT_DATA_FLUSH;
+ 			break;
+ 		default:
+ 			SyncTransferMode = SYNC_REP_NO_WAIT;
+ 			break;
+ 	}
+ }
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 1539,1544 **** XLogSend(bool *caughtup)
--- 1539,1548 ----
  
  		*caughtup = true;
  
+ 		elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <= sentPtr(%X/%X) AND sendTImeLine",
+ 			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
+ 			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
+ 
  		elog(DEBUG1, "walsender reached end of timeline at %X/%X (sent up to %X/%X)",
  			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
  			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 41,46 ****
--- 41,48 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***************
*** 1975,1982 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
  		XLogFlush(recptr);
! 
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
  	 * have been able to write it while we were busy with log flushing because
--- 1977,1990 ----
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
+ 	{
  		XLogFlush(recptr);
! 		/* If synchronous transfer is requested, wait for failback safe standby
! 		 * to receive WAL up to recptr.
! 		 */
! 		if (SyncTransRequested())
! 			SyncRepWaitForLSN(recptr, true, true);
! 	}
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
  	 * have been able to write it while we were busy with log flushing because
*** a/src/backend/utils/cache/relmapper.c
--- b/src/backend/utils/cache/relmapper.c
***************
*** 48,53 ****
--- 48,55 ----
  #include "catalog/pg_tablespace.h"
  #include "catalog/storage.h"
  #include "miscadmin.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/fd.h"
  #include "storage/lwlock.h"
  #include "utils/inval.h"
***************
*** 711,716 **** write_relmap_file(bool shared, RelMapFile *newmap,
--- 713,719 ----
  	int			fd;
  	RelMapFile *realmap;
  	char		mapfilename[MAXPGPATH];
+ 	XLogRecPtr	lsn=InvalidXLogRecPtr;
  
  	/*
  	 * Fill in the overhead fields and update CRC.
***************
*** 753,759 **** write_relmap_file(bool shared, RelMapFile *newmap,
  	{
  		xl_relmap_update xlrec;
  		XLogRecData rdata[2];
- 		XLogRecPtr	lsn;
  
  		/* now errors are fatal ... */
  		START_CRIT_SECTION();
--- 756,761 ----
***************
*** 775,780 **** write_relmap_file(bool shared, RelMapFile *newmap,
--- 777,783 ----
  
  		/* As always, WAL must hit the disk before the data update does */
  		XLogFlush(lsn);
+ 
  	}
  
  	errno = 0;
***************
*** 849,854 **** write_relmap_file(bool shared, RelMapFile *newmap,
--- 852,864 ----
  	/* Critical section done */
  	if (write_wal)
  		END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * If synchronous transfer is requested, wait for failback safe
+ 	 * standby to receive WAL up to recptr.
+ 	 */
+ 	if (SyncTransRequested())
+ 		SyncRepWaitForLSN(lsn, true, true);
  }
  
  /*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 381,386 **** static const struct config_enum_entry synchronous_commit_options[] = {
--- 381,398 ----
  };
  
  /*
+  * Although only "all", "data_flush", and "commit" are documented, we
+  * accept all the likely variants of "off".
+  */
+ static const struct config_enum_entry synchronous_transfer_options[] = {
+ 	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+ 	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+ 	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},
+ 	{"0", SYNCHRONOUS_TRANSFER_COMMIT, true},
+ 	{NULL, 0, false}
+ };
+ 
+ /*
   * Options for enum values stored in other modules
   */
  extern const struct config_enum_entry wal_level_options[];
***************
*** 3290,3295 **** static struct config_enum ConfigureNamesEnum[] =
--- 3302,3317 ----
  	},
  
  	{
+ 		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+ 			gettext_noop("Sets the data flush synchronization level"),
+ 			NULL
+ 		},
+ 		&synchronous_transfer,
+ 		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+ 		NULL, assign_synchronous_transfer, NULL
+ 	},
+ 
+ 	{
  		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
  			gettext_noop("Enables logging of recovery-related debugging information."),
  			gettext_noop("Each level includes all the levels that follow it. The later"
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 220,225 ****
--- 220,227 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_transfer = commit	# data page synchronization level
+ 				# commit, data_flush or all
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 62,67 ****
--- 62,69 ----
  #include "access/subtrans.h"
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "replication/walsender.h"
+ #include "replication/syncrep.h"
  #include "storage/bufmgr.h"
  #include "storage/procarray.h"
  #include "utils/tqual.h"
***************
*** 118,123 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer,
--- 120,137 ----
  
  		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
  			return;				/* not flushed yet, so don't set hint */
+ 
+ 		/*
+ 		 * If synchronous transfer is requested, we check if the commit WAL record
+ 		 * has made to the standby before allowing hint bit updates. We should not
+ 		 * wait for the standby to receive the WAL since its OK to delay hint bit
+ 		 * updates.
+ 		 */
+ 		if (SyncTransRequested())
+ 		{
+ 			if(!SyncRepWaitForLSN(commitLSN, true, false))
+ 				return;
+ 		}
  	}
  
  	tuple->t_infomask |= infomask;
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 19,41 ****
  #define SyncRepRequested() \
  	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  
  /* SyncRepWaitMode */
! #define SYNC_REP_NO_WAIT		-1
! #define SYNC_REP_WAIT_WRITE		0
! #define SYNC_REP_WAIT_FLUSH		1
  
! #define NUM_SYNC_REP_WAIT_MODE	2
  
  /* syncRepState */
! #define SYNC_REP_NOT_WAITING		0
! #define SYNC_REP_WAITING			1
! #define SYNC_REP_WAIT_COMPLETE		2
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
  /* called by user backend */
! extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);
--- 19,60 ----
  #define SyncRepRequested() \
  	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  
+ #define SyncTransRequested() \
+ 	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+ 
+ #define IsSyncRepSkipped() \
+ 	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+ 
  /* SyncRepWaitMode */
! #define SYNC_REP_NO_WAIT					-1
! #define SYNC_REP_WAIT_WRITE					0
! #define SYNC_REP_WAIT_FLUSH					1
! #define SYNC_REP_WAIT_DATA_FLUSH	2
  
! #define NUM_SYNC_REP_WAIT_MODE				3
  
  /* syncRepState */
! #define SYNC_REP_NOT_WAITING					0
! #define SYNC_REP_WAITING						1
! #define SYNC_REP_WAIT_COMPLETE					2
! 
! typedef enum
! {
! 	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
! 	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
! 										 * no wait for WAL */
! 	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush */
! }	SynchronousTransferLevel;
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
+ /* user-settable parameters for failback safe replication */
+ extern int	synchronous_transfer;
+ 
  /* called by user backend */
! extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
! 		bool ForDataFlush, bool Wait);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);
***************
*** 52,56 **** extern int	SyncRepWakeQueue(bool all, int mode);
--- 71,76 ----
  
  extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
  extern void assign_synchronous_commit(int newval, void *extra);
+ extern void assign_synchronous_transfer(int newval, void *extra);
  
  #endif   /* _SYNCREP_H */

#71

Fujii Masao

masao.fujii@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#70)

Re: Patch for fail-back without fresh backup

On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

I attached the patch which I have modified.

Thanks for updating the patch!

Here are the review comments:

I got the compiler warning:

syncrep.c:112: warning: unused variable 'i'

How does synchronous_transfer work with synchronous_commit?

+ * accept all the likely variants of "off".

This comment should be removed because synchronous_transfer
doesn't accept the value "off".

+ {"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},

ISTM the third value "true" should be "false".

+ {"0", SYNCHRONOUS_TRANSFER_COMMIT, true},

Why is this needed?

+        elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <=
sentPtr(%X/%X) AND sendTImeLine",
+             (uint32) (sendTimeLineValidUpto >> 32), (uint32)
sendTimeLineValidUpto,
+             (uint32) (sentPtr >> 32), (uint32) sentPtr);

Why is this needed?

+#define SYNC_REP_WAIT_FLUSH                    1
+#define SYNC_REP_WAIT_DATA_FLUSH    2

Why do we need to separate the wait-queue for wait-data-flush
from that for wait-flush? ISTM that wait-data-flush also can
wait for the replication on the wait-queue for wait-flush, and
which would simplify the patch.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Fujii Masao (#71)

Re: Patch for fail-back without fresh backup

On Thu, Sep 19, 2013 at 12:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

I attached the patch which I have modified.

Thanks for updating the patch!

Here are the review comments:

Thank you for reviewing!

I got the compiler warning:

syncrep.c:112: warning: unused variable 'i'

How does synchronous_transfer work with synchronous_commit?

The currently patch synchronous_transfer doesn't work when
synchronous_commit is set 'off' or 'local'.
if user changes synchronous_commit value on transaction, checkpointer
process can't see it.
Due to that, even if synchronous_commit is changed to 'off' from 'on',
synchronous_transfer doesn't work.
I'm planning to modify the patch so that synchronous_transfer is not
affected by synchronous_commit.

+ * accept all the likely variants of "off".

This comment should be removed because synchronous_transfer
doesn't accept the value "off".

+ {"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},

ISTM the third value "true" should be "false".

+ {"0", SYNCHRONOUS_TRANSFER_COMMIT, true},

Why is this needed?
+        elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <=
sentPtr(%X/%X) AND sendTImeLine",
+             (uint32) (sendTimeLineValidUpto >> 32), (uint32)
sendTimeLineValidUpto,
+             (uint32) (sentPtr >> 32), (uint32) sentPtr);
Why is this needed?

They are unnecessary. I had forgot to remove unnecessary codes.

+#define SYNC_REP_WAIT_FLUSH                    1
+#define SYNC_REP_WAIT_DATA_FLUSH    2
Why do we need to separate the wait-queue for wait-data-flush
from that for wait-flush? ISTM that wait-data-flush also can
wait for the replication on the wait-queue for wait-flush, and
which would simplify the patch.

Yes, it seems not necessary to add queue newly.
I will delete SYNC_REP_WAIT_DATA_FLUSH and related that.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Fujii Masao

masao.fujii@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#72)

Re: Patch for fail-back without fresh backup

On Thu, Sep 19, 2013 at 7:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Thu, Sep 19, 2013 at 12:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

I attached the patch which I have modified.

Thanks for updating the patch!

Here are the review comments:

Thank you for reviewing!

I got the compiler warning:

syncrep.c:112: warning: unused variable 'i'

How does synchronous_transfer work with synchronous_commit?

The currently patch synchronous_transfer doesn't work when
synchronous_commit is set 'off' or 'local'.
if user changes synchronous_commit value on transaction, checkpointer
process can't see it.
Due to that, even if synchronous_commit is changed to 'off' from 'on',
synchronous_transfer doesn't work.
I'm planning to modify the patch so that synchronous_transfer is not
affected by synchronous_commit.

Hmm... when synchronous_transfer is set to data_flush,
IMO the intuitive behaviors are

(1) synchronous_commit = on
A data flush should wait for the corresponding WAL to be
flushed in the standby

(2) synchronous_commit = remote_write
A data flush should wait for the corresponding WAL to be
written to OS in the standby.

(3) synchronous_commit = local
(4) synchronous_commit = off
A data flush should wait for the corresponding WAL to be
written locally in the master.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Fujii Masao (#73)

Re: Patch for fail-back without fresh backup

On Thu, Sep 19, 2013 at 7:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Sep 19, 2013 at 7:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Thu, Sep 19, 2013 at 12:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

I attached the patch which I have modified.

Thanks for updating the patch!

Here are the review comments:

Thank you for reviewing!

I got the compiler warning:

syncrep.c:112: warning: unused variable 'i'

How does synchronous_transfer work with synchronous_commit?

The currently patch synchronous_transfer doesn't work when
synchronous_commit is set 'off' or 'local'.
if user changes synchronous_commit value on transaction, checkpointer
process can't see it.
Due to that, even if synchronous_commit is changed to 'off' from 'on',
synchronous_transfer doesn't work.
I'm planning to modify the patch so that synchronous_transfer is not
affected by synchronous_commit.

Hmm... when synchronous_transfer is set to data_flush,
IMO the intuitive behaviors are

(1) synchronous_commit = on
A data flush should wait for the corresponding WAL to be
flushed in the standby

(2) synchronous_commit = remote_write
A data flush should wait for the corresponding WAL to be
written to OS in the standby.

(3) synchronous_commit = local
(4) synchronous_commit = off
A data flush should wait for the corresponding WAL to be
written locally in the master.

It is good idea.
So synchronous_commit value need to be visible from other process.
To share synchronous_commit with other process, I will try to put
synchronous_commit
value into shared buffer.
Is there already the guc parameter which is shared with other process?
I tried to find such parameter, but there was not it.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Sameer Thakur

samthakur74@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#64)

Re: Patch for fail-back without fresh backup

Attached patch combines documentation patch and source-code patch.

I have had a stab at reviewing the documentation. Have a look.

--- a/doc/src/sgml/config.sgml

+++ b/doc/src/sgml/config.sgml

@@ -1749,6 +1749,50 @@ include 'filename'

</listitem>

</varlistentry>

+ <varlistentry id="guc-synchronous-transfer"
xreflabel="synchronous_transfer">

+ <term><varname>synchronous_transfer</varname>
(<type>enum</type>)</term>

+ <indexterm>

+ <primary><varname>synchronous_transfer</> configuration
parameter</primary>

+ </indexterm>

+ <listitem>

+ <para>

+ This parameter controls the synchronous nature of WAL transfer and

+ maintains file system level consistency between master server and

+ standby server. It specifies whether master server will wait for
file

+ system level change (for example : modifying data page) before

+ the corresponding WAL records are replicated to the standby server.

+ </para>

+ <para>

+ Valid values are <literal>commit</>, <literal>data_flush</> and

+ <literal>all</>. The default value is <literal>commit</>, meaning

+ that master will only wait for transaction commits, this is
equivalent

+ to turning off <literal>synchronous_transfer</> parameter and
standby

+ server will behave as a <quote>synchronous standby </> in

+ Streaming Replication. For value <literal>data_flush</>, master
will

+ wait only for data page modifications but not for transaction

+ commits, hence the standby server will act as <quote>asynchronous

+ failback safe standby</>. For value <literal> all</>, master will
wait

+ for data page modifications as well as for transaction commits and

+ resultant standby server will act as <quote>synchronous failback
safe

+ standby</>.The wait is on background activities and hence will not
create performance overhead.

+ To configure synchronous failback safe standby

+ <xref linkend="guc-synchronous-standby-names"> should be set.

+ </para>

+ </listitem>

+ </varlistentry>

@@ -2258,14 +2302,25 @@ include 'filename'</indexterm>

<para>

- Specifies a comma-separated list of standby names that can support

- <firstterm>synchronous replication</>, as described in

- <xref linkend="synchronous-replication">.

- At any one time there will be at most one active synchronous
standby;

- transactions waiting for commit will be allowed to proceed after

- this standby server confirms receipt of their data.

- The synchronous standby will be the first standby named in this
list

- that is both currently connected and streaming data in real-time

+ Specifies a comma-separated list of standby names. If this
parameter

+ is set then standby will behave as synchronous standby in
replication,

+ as described in <xref linkend="synchronous-replication"> or
synchronous

+ failback safe standby, as described in <xref
linkend="failback-safe">.

+ At any time there will be at most one active standby; when standby
is

+ synchronous standby in replication, transactions waiting for commit

+ will be allowed to proceed after this standby server confirms
receipt

+ of their data. But when standby is synchronous failback safe
standby

+ data page modifications as well as transaction commits will be
allowed

+ to proceed only after this standby server confirms receipt of
their data.

+ If this parameter is set to empty value and

+ <xref linkend="guc-synchronous-transfer"> is set to
<literal>data_flush</>

+ then standby is called as asynchronous failback safe standby and
only

+ data page modifications will wait before corresponding WAL record
is

+ replicated to standby.

+ </para>

+ <para>

+ Synchronous standby in replication will be the first standby named
in

+ this list that is both currently connected and streaming data in
real-time

(as shown by a state of <literal>streaming</literal> in the

<literal>pg_stat_replication</></link> view).

--- a/doc/src/sgml/high-availability.sgml

+++ b/doc/src/sgml/high-availability.sgml

+ <sect2 id="failback-safe">

+ <title>Setting up failback safe standby</title>

+ <indexterm zone="high-availability">

+ <primary>Setting up failback safe standby</primary>

+ </indexterm>

+ <para>

+ PostgreSQL streaming replication offers durability, but if the master
crashes and

+a particular WAL record is unable to reach to standby server, then that

+WAL record is present on master server but not on standby server.

+In such a case master is ahead of standby server in term of WAL records
and data in database.

+This leads to file-system level inconsistency between master and standby
server.

+For example a heap page update on the master might not have been reflected
on standby when master crashes.

+ </para>

+ <para>

+Due to this inconsistency, fresh backup of new master onto new standby is
needed to re-prepare HA cluster.

+Taking fresh backup can be a very time consuming process when database is
of large size. In such a case, disaster recovery

+can take very long time, if streaming replication is used to setup the
high availability cluster.

+ </para>

+ <para>

+If HA cluster is configured with failback safe standby then this fresh
back up can be avoided.

+The <xref linkend="guc-synchronous-transfer"> parameter has control over
all WAL transfers and

+will not make any file system level change until master gets a
confirmation from standby server.

+This avoids the need of a fresh backup by maintaining consistency.

+ </para>

+ <sect3 id="Failback-safe-config">

+ <title>Basic Configuration</title>

+ <para>

+ Failback safe standby can be asynchronous or synchronous in nature.

+ This will depend upon whether master will wait for transaction commit

+ or not. By default failback safe mechanism is turned off.

+ </para>

+ <para>

+ The first step to configure HA with failback safe standby is to setup

+ streaming replication. Configuring synchronous failback safe standby

+ requires setting up <xref linkend="guc-synchronous-transfer"> to

+ <literal>all</> and <xref linkend="guc-synchronous-standby-names">

+ must be set to a non-empty value. This configuration will cause each

+ commit and data page modification to wait for confirmation that standby

+ has written corresponding WAL record to durable storage. Configuring

+ asynchronous failback safe standby requires only setting up

+ <xref linkend="guc-synchronous-transfer"> to <literal> data_flush</>.

+ This configuration will cause only data page modifications to wait

+ for confirmation that standby has written corresponding WAL record

+ to durable storage.

+ </para>

+ </sect3>

+ </sect2>

</sect1>

</para>

<para>

- So, switching from primary to standby server can be fast but requires

- some time to re-prepare the failover cluster. Regular switching from

- primary to standby is useful, since it allows regular downtime on

- each system for maintenance. This also serves as a test of the

- failover mechanism to ensure that it will really work when you need it.

- Written administration procedures are advised.

+ At the time of failover there is a possibility of file-system level

+ inconsistency between the old primary and the old standby server and
hence

+ a fresh backup from new master onto old master is needed for
configuring

+ the old primary server as a new standby server. Without taking fresh

+ backup even if the new standby starts, streaming replication does not

+ start successfully. The activity of taking backup can be fast for
smaller

+ databases but for a large database this activity requires more time to
re-prepare the

+ failover cluster in streaming replication configuration of HA cluster.

+ This could break the service level agreement for crash

+ recovery. The need of fresh backup and problem of long

+ recovery time can be solved by using if HA cluster is configured with

+ failback safe standby see <xref linkend="failback-safe">.

+ Failback safe standby allows synchronous WAL transfer at required
places

+ while maintaining the file-system level consistency between master and
standby

+ server, without having backup to be taken on the old master.

+ </para>

+ <para>

+ Regular switching from primary to standby is useful, since it allows

+ regular downtime on each system for maintenance. This also serves as

+ a test of the failover mechanism to ensure that it will really work

+ when you need it. Written administration procedures are advised.

</para>

<para>

diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml

index 2af1738..da3820f 100644

--- a/doc/src/sgml/perform.sgml

+++ b/doc/src/sgml/perform.sgml

</para>

</listitem>

+ <listitem>

+ <para>

+ Set <xref linkend="guc-synchronous-transfer"> to commit; there is no

+ need to guard against database inconsistency between master and
standby during failover.

+ </para>
+ </listitem>

#76

Samrat Revagade

revagade.samrat@gmail.com

over 12 years ago

In reply to: Sameer Thakur (#75)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Fri, Sep 20, 2013 at 3:40 PM, Sameer Thakur <samthakur74@gmail.com>wrote:

Attached patch combines documentation patch and source-code patch.

I have had a stab at reviewing the documentation. Have a look.

Thanks.
Attached patch implements suggestions in documentation.
But comments from Fujii-san still needs to be implemented .
We will implement them soon.

Attachments:

synchronous_transfer_v9.patchapplication/octet-stream; name=synchronous_transfer_v9.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 370aa09..86d2265 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1749,6 +1749,39 @@ include 'filename'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-synchronous-transfer" xreflabel="synchronous_transfer">
+      <term><varname>synchronous_transfer</varname> (<type>enum</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_transfer</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        This parameter controls the synchronous nature of WAL transfer and
+        maintains file system level consistency between master server and
+        standby server. Specifies whether master server will wait for file
+        system level change (for example : modifying data page) before
+        corresponding WAL records are replicated to the standby server.
+       </para>
+       <para>
+        Valid values are <literal>commit</>, <literal>data_flush</> and
+        <literal>all</>. The default value is <literal>commit</>, meaning
+        that master only wait for transaction commits, this is equivalent
+        to turning off <literal>synchronous_transfer</> parameter and standby
+        server will behave as a <quote>synchronous standby </> in
+        Streaming Replication. For value <literal>data_flush</>, master will
+        wait only for data page modifications but not for transaction
+        commits, hence the standby server will act as <quote>asynchronous
+        failback safe standby</>. For value <literal> all</>, master will wait
+        for data page modifications as well as for transaction commits and
+        resultant standby server will act as <quote>synchronous failback safe
+        standby</>. The wait is on background activities and hence will not
+        create much performance overhead.
+        To configure synchronous failback safe standby
+        <xref linkend="guc-synchronous-standby-names"> should be set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-sync-method" xreflabel="wal_sync_method">
       <term><varname>wal_sync_method</varname> (<type>enum</type>)</term>
       <indexterm>
@@ -2258,14 +2291,25 @@ include 'filename'
       </indexterm>
       <listitem>
        <para>
-        Specifies a comma-separated list of standby names that can support
-        <firstterm>synchronous replication</>, as described in
-        <xref linkend="synchronous-replication">.
-        At any one time there will be at most one active synchronous standby;
-        transactions waiting for commit will be allowed to proceed after
-        this standby server confirms receipt of their data.
-        The synchronous standby will be the first standby named in this list
-        that is both currently connected and streaming data in real-time
+        Specifies a comma-separated list of standby names. If this parameter
+        is set then standby will behave as synchronous standby in replication,
+        as described in <xref linkend="synchronous-replication"> or synchronous
+        failback safe standby, as described in <xref linkend="failback-safe">.
+        At any time there will be at most one active standby; when standby is
+        synchronous standby in replication, transactions waiting for commit
+        will be allowed to proceed after this standby server confirms receipt
+        of their data. But when standby is synchronous failback safe standby
+        data page modifications as well as transaction commits will be allowed
+        to proceed only after this standby server confirms receipt of their data.
+        If this parameter is set to empty value and
+        <xref linkend="guc-synchronous-transfer"> is set to <literal>data_flush</>
+        then standby is called as asynchronous failback safe standby and only
+        data page modifications will wait before corresponding WAL record is
+        replicated to standby.
+       </para>
+       <para>
+        Synchronous standby in replication will be the first standby named in
+        this list that is both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c8f6fa8..e551077 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1140,6 +1140,64 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 
    </sect3>
   </sect2>
+
+  <sect2 id="failback-safe">
+     <title>Setting up failback safe standby</title>
+
+   <indexterm zone="high-availability">
+       <primary>Setting up failback safe standby</primary>
+   </indexterm>
+
+   <para>
+    PostgreSQL Streaming Replication offers durability, but if the master
+    crashes and particular WAL record is unable to reach to standby
+    server, then that WAL record is present on master server but not
+    on standby server. In such a case master is ahead of standby server
+    in term of WAL records and Data in database. This will lead to
+    file-system level inconsistency between master and standby server.
+    For example a heap page update on the master might not have been reflected
+    on standby when master crashes.
+   </para>
+
+   <para>
+    Due to this inconsistency fresh backup of new master onto new standby
+    is needed to re-prepare HA cluster. Taking fresh backup can be very
+    time consuming process when database is of large size. In such a case
+    disaster recovery can take very long time if Streaming Replication is
+    used to setup the high availability cluster.
+   </para>
+
+   <para>
+    If HA cluster is configured with failback safe standby then this fresh
+    back up can be avoided. The <xref linkend="guc-synchronous-transfer">
+    parameter has control over all WAL transfers and will not make any file
+    system level change until master gets a confirmation from standby server.
+    This avoids the need of a fresh backup by maintaining consistency.
+   </para>
+
+   <sect3 id="Failback-safe-config">
+    <title>Basic Configuration</title>
+   <para>
+    Failback safe standby can be asynchronous or synchronous in nature.
+    This will depend upon whether master will wait for transaction commit
+    or not. By default failback safe mechanism is turned off.
+   </para>
+
+   <para>
+    The first step to configure HA with failback safe standby is to setup
+    synchronous streaming replication. Configuring synchronous failback
+    safe standby requires setting up  <xref linkend="guc-synchronous-transfer"> to
+    <literal>all</>. This configuration will cause each commit and data
+    page modification to wait for confirmation that standby has written
+    corresponding WAL record to durable storage. Configuring asynchronous
+    failback safe standby requires setting up <xref linkend="guc-synchronous-transfer">
+    to <literal> data_flush</>. This configuration will cause only data
+    page modifications to wait for confirmation that standby has written
+    corresponding WAL record to durable storage.
+   </para>
+
+  </sect3>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
@@ -1201,12 +1259,28 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
    </para>
 
    <para>
-    So, switching from primary to standby server can be fast but requires
-    some time to re-prepare the failover cluster. Regular switching from
-    primary to standby is useful, since it allows regular downtime on
-    each system for maintenance. This also serves as a test of the
-    failover mechanism to ensure that it will really work when you need it.
-    Written administration procedures are advised.
+    At the time of failover there is a possibility of file-system level
+    inconsistency between old primary and old standby server hence
+    fresh backup from new master onto old master is needed for Configuring
+    old standby server as a new standby server. Without taking fresh
+    backup even if the new standby starts, streaming replication does not
+    start successfully. The activity of taking backup can be fast for smaller
+    database but for large database requires more time to re-prepare the
+    failover cluster and could break the service level agreement of crash
+    recovery. The need of fresh backup and problem of long
+    recovery time can be solved if HA cluster is configured with
+    failback safe standby see <xref linkend="failback-safe">.
+    Failback safe standby makes WAL transfer synchronous at required
+    places and maintains the file-system level consistency between
+    master and standby server and the old master server can be easily
+    configured as new standby server.
+   </para>
+
+   <para>
+    Regular switching from primary to standby is useful, since it allows
+    regular downtime on each system for maintenance. This also serves as
+    a test of the failover mechanism to ensure that it will really work
+    when you need it. Written administration procedures are advised.
    </para>
 
    <para>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 2af1738..b074a91 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1569,6 +1569,14 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
        corruption) in case of a crash of the <emphasis>database</> alone.
       </para>
      </listitem>
+
+     <listitem>
+      <para>
+       Set <xref linkend="guc-synchronous-transfer"> to commit; there is no
+       need to guard against database inconsistency between master and standby
+       server during failover.
+      </para>
+     </listitem>
     </itemizedlist>
    </para>
   </sect1>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index cb95aa3..d216b2e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -37,6 +37,8 @@
 #include "access/transam.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -708,8 +710,10 @@ WriteZeroPageXlogRec(int pageno)
 /*
  * Write a TRUNCATE xlog record
  *
- * We must flush the xlog record to disk before returning --- see notes
- * in TruncateCLOG().
+ * Before returning we must flush the xlog record to disk
+ * and if synchronous transfer is requested wait for failback
+ * safe standby to receive WAL up to recptr.
+ * --- see notes in TruncateCLOG().
  */
 static void
 WriteTruncateXlogRec(int pageno)
@@ -723,6 +727,12 @@ WriteTruncateXlogRec(int pageno)
 	rdata.next = NULL;
 	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
 	XLogFlush(recptr);
+
+	/*
+	 * Wait for failback safe standby.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(recptr, true, true);
 }
 
 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5e53593..069630b 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,6 +54,8 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
@@ -744,6 +746,13 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 			START_CRIT_SECTION();
 			XLogFlush(max_lsn);
 			END_CRIT_SECTION();
+
+			/*
+			 * If synchronous transfer is requested, wait for failback safe
+			 * standby to receive WAL up to max_lsn.
+			 */
+			if (SyncTransRequested())
+				SyncRepWaitForLSN(max_lsn, true, true);
 		}
 	}
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e975f8d..38a9e9c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1091,12 +1091,12 @@ EndPrepare(GlobalTransaction gxact)
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked the prepare, but still show as
 	 * running in the procarray (twice!) and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(gxact->prepare_lsn);
+	SyncRepWaitForLSN(gxact->prepare_lsn, false, true);
 
 	records.tail = records.head = NULL;
 }
@@ -2058,12 +2058,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
 
 /*
@@ -2138,10 +2138,10 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0591f3f..25210df 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1189,13 +1189,13 @@ RecordTransactionCommit(void)
 	latestXid = TransactionIdLatest(xid, nchildren, children);
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
 	if (wrote_xlog)
-		SyncRepWaitForLSN(XactLastRecEnd);
+		SyncRepWaitForLSN(XactLastRecEnd, false, true);
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	XactLastRecEnd = 0;
@@ -4690,8 +4690,17 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 	 * for any user that requested ForceSyncCommit().
 	 */
 	if (XactCompletionForceSyncCommit(xinfo))
+	{
 		XLogFlush(lsn);
 
+		/*
+		 * If synchronous transfer is requested, wait for failback safe
+		 * standby to receive WAL up to lsn,
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
+
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fc495d6..ef46419 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -39,8 +39,10 @@
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/barrier.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -8282,6 +8284,18 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
+	 * If synchronous transfer is requested, wait for failback safe standby
+	 * to receive WAL up to checkpoint WAL record. Otherwise if failure occurs
+	 * before standby receives CHECKPOINT WAL record causes an inconsistency
+	 * between control files of master and standby. Because of this master will
+	 * start from a location which is not known to the standby at the time fail-over.
+	 *
+	 * There is no need to wait for shutdown CHECKPOINT.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(recptr, true, !shutdown);
+
+	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
 	smgrpostckpt();
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 971a149..050a6ba 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -25,6 +25,8 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
@@ -288,6 +290,14 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		 */
 		if (fsm || vm)
 			XLogFlush(lsn);
+
+		/*
+		 * If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to lsn. Otherwise, we may have a situation where
+		 * the heap is truncated, but the action never replayed on the standby.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
 	}
 
 	/* Do the real work */
@@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
 		 */
 		XLogFlush(lsn);
 
+		/*
+		 * If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to lsn.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
+
 		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
 
 		/* Also tell xlogutils.c about it */
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 8cf1346..bbe88f9 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -66,6 +66,8 @@ char	   *SyncRepStandbyNames;
 static bool announce_next_takeover = true;
 
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+static int	SyncTransferMode = SYNC_REP_NO_WAIT;
+int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
@@ -83,20 +85,31 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
- * Wait for synchronous replication, if requested by user.
+ * Wait for synchronous/failback safe standby, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
- * change that state to SYNC_REP_WAITING before adding ourselves
- * to the wait queue. During SyncRepWakeQueue() a WALSender changes
- * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
- * This backend then resets its state to SYNC_REP_NOT_WAITING.
+ * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
+ * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
+ * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
+ * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
+ *
+ * ForDataFlush - if TRUE, we wait before flushing data page.
+ * Otherwise wait for the sync standby
+ *
+ * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
+ * the standby has already made progressed upto the given XactCommitLSN
+ *
+ * Return TRUE if either the synchronous standby/failback safe standby is not
+ * configured/turned off OR the standby has made enough progress
  */
-void
-SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+bool
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
 {
 	char	   *new_status = NULL;
 	const char *old_status;
-	int			mode = SyncRepWaitMode;
+	int			mode = !ForDataFlush ? SyncRepWaitMode : SyncTransferMode;
+	bool		ret;
+	int			i;
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
@@ -104,7 +117,33 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * need to be connected.
 	 */
 	if (!SyncRepRequested() || !SyncStandbysDefined())
-		return;
+		return true;
+
+	/*
+	 * If the caller has specified ForDataFlush, but synchronous transfer
+	 * is not specified or its turned off, exit.
+	 *
+	 * We would like to allow the failback safe mechanism even for cascaded
+	 * standbys as well. But we can't really wait for the standby to catch
+	 * up until we reach a consistent state since the standbys won't be
+	 * even able to connect without us reaching in that state (XXX Confirm)
+	 */
+	if ((!SyncTransRequested()) && ForDataFlush)
+		return true;
+
+	/*
+	 * If the caller has not specified ForDataFlush, but synchronous commit
+	 * is skipped by values of synchronous_transfer, exit.
+	 */
+	if (IsSyncRepSkipped() && !ForDataFlush)
+		return true;
+
+	/*
+	 * If both synchronous replication and synchronous transfer
+	 * are requested but the system still in recovery, exit.
+	 */
+	if (RecoveryInProgress())
+		return true;
 
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 	Assert(WalSndCtl != NULL);
@@ -120,11 +159,20 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * condition but we'll be fetching that cache line anyway so it's likely to
 	 * be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
 		XactCommitLSN <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
-		return;
+		return true;
+	}
+
+	/*
+	 * Exit if we are told not to block on the standby.
+	 */
+	if (!Wait)
+	{
+		LWLockRelease(SyncRepLock);
+		return false;
 	}
 
 	/*
@@ -151,6 +199,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		new_status[len] = '\0'; /* truncate off " waiting ..." */
 	}
 
+	ret = false;
+
 	/*
 	 * Wait for specified LSN to be confirmed.
 	 *
@@ -187,7 +237,10 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			LWLockRelease(SyncRepLock);
 		}
 		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+		{
+			ret = true;
 			break;
+		}
 
 		/*
 		 * If a wait for synchronous replication is pending, we can neither
@@ -264,6 +317,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		set_ps_display(new_status, false);
 		pfree(new_status);
 	}
+
+	return ret;
 }
 
 /*
@@ -371,6 +426,7 @@ SyncRepReleaseWaiters(void)
 	volatile WalSnd *syncWalSnd = NULL;
 	int			numwrite = 0;
 	int			numflush = 0;
+	int			numdataflush = 0;
 	int			priority = 0;
 	int			i;
 
@@ -438,13 +494,21 @@ SyncRepReleaseWaiters(void)
 	{
 		walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
 		numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
+
+	}
+	if (walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] < MyWalSnd->flush)
+	{
+		walsndctl->lsn[SYNC_REP_WAIT_DATA_FLUSH] = MyWalSnd->flush;
+		numdataflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_DATA_FLUSH);
+
 	}
 
 	LWLockRelease(SyncRepLock);
 
-	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
-		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
-	   numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
+	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, %d procs up to data flush %X/%X",
+		 numwrite    , (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
+		 numflush    , (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush,
+		 numdataflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
 
 	/*
 	 * If we are managing the highest priority standby, though we weren't
@@ -710,3 +774,18 @@ assign_synchronous_commit(int newval, void *extra)
 			break;
 	}
 }
+
+void
+assign_synchronous_transfer(int newval, void *extra)
+{
+	switch (newval)
+	{
+		case SYNCHRONOUS_TRANSFER_ALL:
+		case SYNCHRONOUS_TRANSFER_DATA_FLUSH:
+			SyncTransferMode = SYNC_REP_WAIT_DATA_FLUSH;
+			break;
+		default:
+			SyncTransferMode = SYNC_REP_NO_WAIT;
+			break;
+	}
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index afd559d..492e039 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1539,6 +1539,10 @@ XLogSend(bool *caughtup)
 
 		*caughtup = true;
 
+		elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <= sentPtr(%X/%X) AND sendTImeLine",
+			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
+			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
+
 		elog(DEBUG1, "walsender reached end of timeline at %X/%X (sent up to %X/%X)",
 			 (uint32) (sendTimeLineValidUpto >> 32), (uint32) sendTimeLineValidUpto,
 			 (uint32) (sentPtr >> 32), (uint32) sentPtr);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f848391..7a2e285 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -41,6 +41,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -1975,8 +1977,14 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	 * skip the flush if the buffer isn't permanent.
 	 */
 	if (buf->flags & BM_PERMANENT)
+	{
 		XLogFlush(recptr);
-
+		/* If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to recptr.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(recptr, true, true);
+	}
 	/*
 	 * Now it's safe to write buffer to disk. Note that no one else should
 	 * have been able to write it while we were busy with log flushing because
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 18f0342..e92b607 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -48,6 +48,8 @@
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
@@ -711,6 +713,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	int			fd;
 	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
+	XLogRecPtr	lsn=InvalidXLogRecPtr;
 
 	/*
 	 * Fill in the overhead fields and update CRC.
@@ -753,7 +756,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	{
 		xl_relmap_update xlrec;
 		XLogRecData rdata[2];
-		XLogRecPtr	lsn;
 
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
@@ -775,6 +777,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 
 		/* As always, WAL must hit the disk before the data update does */
 		XLogFlush(lsn);
+
 	}
 
 	errno = 0;
@@ -849,6 +852,13 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
+
+	/*
+	 * If synchronous transfer is requested, wait for failback safe
+	 * standby to receive WAL up to recptr.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(lsn, true, true);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3107f9c..ccac724 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -381,6 +381,18 @@ static const struct config_enum_entry synchronous_commit_options[] = {
 };
 
 /*
+ * Although only "all", "data_flush", and "commit" are documented, we
+ * accept all the likely variants of "off".
+ */
+static const struct config_enum_entry synchronous_transfer_options[] = {
+	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},
+	{"0", SYNCHRONOUS_TRANSFER_COMMIT, true},
+	{NULL, 0, false}
+};
+
+/*
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
@@ -3300,6 +3312,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Sets the data flush synchronization level"),
+			NULL
+		},
+		&synchronous_transfer,
+		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+		NULL, assign_synchronous_transfer, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..d6603c2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -220,6 +220,8 @@
 #synchronous_standby_names = ''	# standby servers that provide sync rep
 				# comma-separated list of application_name
 				# from standby(s); '*' = all
+#synchronous_transfer = commit	# data page synchronization level
+				# commit, data_flush or all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
 # - Standby Servers -
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index ed66c49..6cf3f26 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -60,6 +60,8 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
 #include "utils/tqual.h"
@@ -115,6 +117,18 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 
 		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
 			return;				/* not flushed yet, so don't set hint */
+
+		/*
+		 * If synchronous transfer is requested, we check if the commit WAL record
+		 * has made to the standby before allowing hint bit updates. We should not
+		 * wait for the standby to receive the WAL since its OK to delay hint bit
+		 * updates.
+		 */
+		if (SyncTransRequested())
+		{
+			if(!SyncRepWaitForLSN(commitLSN, true, false))
+				return;
+		}
 	}
 
 	tuple->t_infomask |= infomask;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ac23ea6..4540625 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -19,23 +19,42 @@
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
 
+#define SyncTransRequested() \
+	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+
+#define IsSyncRepSkipped() \
+	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+
 /* SyncRepWaitMode */
-#define SYNC_REP_NO_WAIT		-1
-#define SYNC_REP_WAIT_WRITE		0
-#define SYNC_REP_WAIT_FLUSH		1
+#define SYNC_REP_NO_WAIT					-1
+#define SYNC_REP_WAIT_WRITE					0
+#define SYNC_REP_WAIT_FLUSH					1
+#define SYNC_REP_WAIT_DATA_FLUSH	2
 
-#define NUM_SYNC_REP_WAIT_MODE	2
+#define NUM_SYNC_REP_WAIT_MODE				3
 
 /* syncRepState */
-#define SYNC_REP_NOT_WAITING		0
-#define SYNC_REP_WAITING			1
-#define SYNC_REP_WAIT_COMPLETE		2
+#define SYNC_REP_NOT_WAITING					0
+#define SYNC_REP_WAITING						1
+#define SYNC_REP_WAIT_COMPLETE					2
+
+typedef enum
+{
+	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
+	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
+										 * no wait for WAL */
+	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush */
+}	SynchronousTransferLevel;
 
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
 
+/* user-settable parameters for failback safe replication */
+extern int	synchronous_transfer;
+
 /* called by user backend */
-extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
+		bool ForDataFlush, bool Wait);
 
 /* called at backend exit */
 extern void SyncRepCleanupAtProcExit(void);
@@ -52,5 +71,6 @@ extern int	SyncRepWakeQueue(bool all, int mode);
 
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_commit(int newval, void *extra);
+extern void assign_synchronous_transfer(int newval, void *extra);
 
 #endif   /* _SYNCREP_H */

#77

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#76)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Fri, Sep 20, 2013 at 10:33 PM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:

On Fri, Sep 20, 2013 at 3:40 PM, Sameer Thakur <samthakur74@gmail.com>
wrote:

Attached patch combines documentation patch and source-code patch.

I have had a stab at reviewing the documentation. Have a look.

Thanks.
Attached patch implements suggestions in documentation.
But comments from Fujii-san still needs to be implemented .
We will implement them soon.

I have attached the patch which modify based on Fujii-san suggested.

If synchronous_transfer is set 'data_flush', behaviour of
synchronous_transfer with synchronous_commit is

(1) synchronous_commit = on
A data flush should wait for the corresponding WAL to be
flushed in the standby

(2) synchronous_commit = remote_write
A data flush should wait for the corresponding WAL to be
written to OS in the standby.

(3) synchronous_commit = local
(4) synchronous_commit = off
A data flush should wait for the corresponding WAL to be
written locally in the master.

Even if user changes synchronous_commit value in transaction,
other process (e.g. checkpointer process) can't confirm it.
Currently patch, each processes uses locally synchronous_commit.

Regards,

-------
Sawada Masahiko

Attachments:

synchronous_transfer_v10.patchapplication/octet-stream; name=synchronous_transfer_v10.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 370aa09..86d2265 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1749,6 +1749,39 @@ include 'filename'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-synchronous-transfer" xreflabel="synchronous_transfer">
+      <term><varname>synchronous_transfer</varname> (<type>enum</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_transfer</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        This parameter controls the synchronous nature of WAL transfer and
+        maintains file system level consistency between master server and
+        standby server. Specifies whether master server will wait for file
+        system level change (for example : modifying data page) before
+        corresponding WAL records are replicated to the standby server.
+       </para>
+       <para>
+        Valid values are <literal>commit</>, <literal>data_flush</> and
+        <literal>all</>. The default value is <literal>commit</>, meaning
+        that master only wait for transaction commits, this is equivalent
+        to turning off <literal>synchronous_transfer</> parameter and standby
+        server will behave as a <quote>synchronous standby </> in
+        Streaming Replication. For value <literal>data_flush</>, master will
+        wait only for data page modifications but not for transaction
+        commits, hence the standby server will act as <quote>asynchronous
+        failback safe standby</>. For value <literal> all</>, master will wait
+        for data page modifications as well as for transaction commits and
+        resultant standby server will act as <quote>synchronous failback safe
+        standby</>. The wait is on background activities and hence will not
+        create much performance overhead.
+        To configure synchronous failback safe standby
+        <xref linkend="guc-synchronous-standby-names"> should be set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-sync-method" xreflabel="wal_sync_method">
       <term><varname>wal_sync_method</varname> (<type>enum</type>)</term>
       <indexterm>
@@ -2258,14 +2291,25 @@ include 'filename'
       </indexterm>
       <listitem>
        <para>
-        Specifies a comma-separated list of standby names that can support
-        <firstterm>synchronous replication</>, as described in
-        <xref linkend="synchronous-replication">.
-        At any one time there will be at most one active synchronous standby;
-        transactions waiting for commit will be allowed to proceed after
-        this standby server confirms receipt of their data.
-        The synchronous standby will be the first standby named in this list
-        that is both currently connected and streaming data in real-time
+        Specifies a comma-separated list of standby names. If this parameter
+        is set then standby will behave as synchronous standby in replication,
+        as described in <xref linkend="synchronous-replication"> or synchronous
+        failback safe standby, as described in <xref linkend="failback-safe">.
+        At any time there will be at most one active standby; when standby is
+        synchronous standby in replication, transactions waiting for commit
+        will be allowed to proceed after this standby server confirms receipt
+        of their data. But when standby is synchronous failback safe standby
+        data page modifications as well as transaction commits will be allowed
+        to proceed only after this standby server confirms receipt of their data.
+        If this parameter is set to empty value and
+        <xref linkend="guc-synchronous-transfer"> is set to <literal>data_flush</>
+        then standby is called as asynchronous failback safe standby and only
+        data page modifications will wait before corresponding WAL record is
+        replicated to standby.
+       </para>
+       <para>
+        Synchronous standby in replication will be the first standby named in
+        this list that is both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c8f6fa8..e551077 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1140,6 +1140,64 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 
    </sect3>
   </sect2>
+
+  <sect2 id="failback-safe">
+     <title>Setting up failback safe standby</title>
+
+   <indexterm zone="high-availability">
+       <primary>Setting up failback safe standby</primary>
+   </indexterm>
+
+   <para>
+    PostgreSQL Streaming Replication offers durability, but if the master
+    crashes and particular WAL record is unable to reach to standby
+    server, then that WAL record is present on master server but not
+    on standby server. In such a case master is ahead of standby server
+    in term of WAL records and Data in database. This will lead to
+    file-system level inconsistency between master and standby server.
+    For example a heap page update on the master might not have been reflected
+    on standby when master crashes.
+   </para>
+
+   <para>
+    Due to this inconsistency fresh backup of new master onto new standby
+    is needed to re-prepare HA cluster. Taking fresh backup can be very
+    time consuming process when database is of large size. In such a case
+    disaster recovery can take very long time if Streaming Replication is
+    used to setup the high availability cluster.
+   </para>
+
+   <para>
+    If HA cluster is configured with failback safe standby then this fresh
+    back up can be avoided. The <xref linkend="guc-synchronous-transfer">
+    parameter has control over all WAL transfers and will not make any file
+    system level change until master gets a confirmation from standby server.
+    This avoids the need of a fresh backup by maintaining consistency.
+   </para>
+
+   <sect3 id="Failback-safe-config">
+    <title>Basic Configuration</title>
+   <para>
+    Failback safe standby can be asynchronous or synchronous in nature.
+    This will depend upon whether master will wait for transaction commit
+    or not. By default failback safe mechanism is turned off.
+   </para>
+
+   <para>
+    The first step to configure HA with failback safe standby is to setup
+    synchronous streaming replication. Configuring synchronous failback
+    safe standby requires setting up  <xref linkend="guc-synchronous-transfer"> to
+    <literal>all</>. This configuration will cause each commit and data
+    page modification to wait for confirmation that standby has written
+    corresponding WAL record to durable storage. Configuring asynchronous
+    failback safe standby requires setting up <xref linkend="guc-synchronous-transfer">
+    to <literal> data_flush</>. This configuration will cause only data
+    page modifications to wait for confirmation that standby has written
+    corresponding WAL record to durable storage.
+   </para>
+
+  </sect3>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
@@ -1201,12 +1259,28 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
    </para>
 
    <para>
-    So, switching from primary to standby server can be fast but requires
-    some time to re-prepare the failover cluster. Regular switching from
-    primary to standby is useful, since it allows regular downtime on
-    each system for maintenance. This also serves as a test of the
-    failover mechanism to ensure that it will really work when you need it.
-    Written administration procedures are advised.
+    At the time of failover there is a possibility of file-system level
+    inconsistency between old primary and old standby server hence
+    fresh backup from new master onto old master is needed for Configuring
+    old standby server as a new standby server. Without taking fresh
+    backup even if the new standby starts, streaming replication does not
+    start successfully. The activity of taking backup can be fast for smaller
+    database but for large database requires more time to re-prepare the
+    failover cluster and could break the service level agreement of crash
+    recovery. The need of fresh backup and problem of long
+    recovery time can be solved if HA cluster is configured with
+    failback safe standby see <xref linkend="failback-safe">.
+    Failback safe standby makes WAL transfer synchronous at required
+    places and maintains the file-system level consistency between
+    master and standby server and the old master server can be easily
+    configured as new standby server.
+   </para>
+
+   <para>
+    Regular switching from primary to standby is useful, since it allows
+    regular downtime on each system for maintenance. This also serves as
+    a test of the failover mechanism to ensure that it will really work
+    when you need it. Written administration procedures are advised.
    </para>
 
    <para>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 2af1738..b074a91 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1569,6 +1569,14 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
        corruption) in case of a crash of the <emphasis>database</> alone.
       </para>
      </listitem>
+
+     <listitem>
+      <para>
+       Set <xref linkend="guc-synchronous-transfer"> to commit; there is no
+       need to guard against database inconsistency between master and standby
+       server during failover.
+      </para>
+     </listitem>
     </itemizedlist>
    </para>
   </sect1>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index cb95aa3..d216b2e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -37,6 +37,8 @@
 #include "access/transam.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -708,8 +710,10 @@ WriteZeroPageXlogRec(int pageno)
 /*
  * Write a TRUNCATE xlog record
  *
- * We must flush the xlog record to disk before returning --- see notes
- * in TruncateCLOG().
+ * Before returning we must flush the xlog record to disk
+ * and if synchronous transfer is requested wait for failback
+ * safe standby to receive WAL up to recptr.
+ * --- see notes in TruncateCLOG().
  */
 static void
 WriteTruncateXlogRec(int pageno)
@@ -723,6 +727,12 @@ WriteTruncateXlogRec(int pageno)
 	rdata.next = NULL;
 	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
 	XLogFlush(recptr);
+
+	/*
+	 * Wait for failback safe standby.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(recptr, true, true);
 }
 
 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5e53593..069630b 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,6 +54,8 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
@@ -744,6 +746,13 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 			START_CRIT_SECTION();
 			XLogFlush(max_lsn);
 			END_CRIT_SECTION();
+
+			/*
+			 * If synchronous transfer is requested, wait for failback safe
+			 * standby to receive WAL up to max_lsn.
+			 */
+			if (SyncTransRequested())
+				SyncRepWaitForLSN(max_lsn, true, true);
 		}
 	}
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e975f8d..38a9e9c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1091,12 +1091,12 @@ EndPrepare(GlobalTransaction gxact)
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked the prepare, but still show as
 	 * running in the procarray (twice!) and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(gxact->prepare_lsn);
+	SyncRepWaitForLSN(gxact->prepare_lsn, false, true);
 
 	records.tail = records.head = NULL;
 }
@@ -2058,12 +2058,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
 
 /*
@@ -2138,10 +2138,10 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	END_CRIT_SECTION();
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
-	SyncRepWaitForLSN(recptr);
+	SyncRepWaitForLSN(recptr, false, true);
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0591f3f..25210df 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1189,13 +1189,13 @@ RecordTransactionCommit(void)
 	latestXid = TransactionIdLatest(xid, nchildren, children);
 
 	/*
-	 * Wait for synchronous replication, if required.
+	 * Wait for synchronous/synchronous failback safe standby, if required.
 	 *
 	 * Note that at this stage we have marked clog, but still show as running
 	 * in the procarray and continue to hold locks.
 	 */
 	if (wrote_xlog)
-		SyncRepWaitForLSN(XactLastRecEnd);
+		SyncRepWaitForLSN(XactLastRecEnd, false, true);
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	XactLastRecEnd = 0;
@@ -4690,8 +4690,17 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 	 * for any user that requested ForceSyncCommit().
 	 */
 	if (XactCompletionForceSyncCommit(xinfo))
+	{
 		XLogFlush(lsn);
 
+		/*
+		 * If synchronous transfer is requested, wait for failback safe
+		 * standby to receive WAL up to lsn,
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
+
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fc495d6..ef46419 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -39,8 +39,10 @@
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/barrier.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
@@ -8282,6 +8284,18 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
+	 * If synchronous transfer is requested, wait for failback safe standby
+	 * to receive WAL up to checkpoint WAL record. Otherwise if failure occurs
+	 * before standby receives CHECKPOINT WAL record causes an inconsistency
+	 * between control files of master and standby. Because of this master will
+	 * start from a location which is not known to the standby at the time fail-over.
+	 *
+	 * There is no need to wait for shutdown CHECKPOINT.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(recptr, true, !shutdown);
+
+	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
 	smgrpostckpt();
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 971a149..050a6ba 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -25,6 +25,8 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
@@ -288,6 +290,14 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		 */
 		if (fsm || vm)
 			XLogFlush(lsn);
+
+		/*
+		 * If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to lsn. Otherwise, we may have a situation where
+		 * the heap is truncated, but the action never replayed on the standby.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
 	}
 
 	/* Do the real work */
@@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
 		 */
 		XLogFlush(lsn);
 
+		/*
+		 * If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to lsn.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(lsn, true, true);
+
 		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
 
 		/* Also tell xlogutils.c about it */
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 8cf1346..6814886 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -67,6 +67,8 @@ static bool announce_next_takeover = true;
 
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
 
+int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
+
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 
@@ -83,20 +85,30 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
- * Wait for synchronous replication, if requested by user.
+ * Wait for synchronous/failback safe standby, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
- * change that state to SYNC_REP_WAITING before adding ourselves
- * to the wait queue. During SyncRepWakeQueue() a WALSender changes
- * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
- * This backend then resets its state to SYNC_REP_NOT_WAITING.
+ * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
+ * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
+ * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
+ * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
+ *
+ * ForDataFlush - if TRUE, we wait before flushing data page.
+ * Otherwise wait for the sync standby
+ *
+ * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
+ * the standby has already made progressed upto the given XactCommitLSN
+ *
+ * Return TRUE if either the synchronous standby/failback safe standby is not
+ * configured/turned off OR the standby has made enough progress
  */
-void
-SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+bool
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
 {
 	char	   *new_status = NULL;
 	const char *old_status;
 	int			mode = SyncRepWaitMode;
+	bool		ret;
 
 	/*
 	 * Fast exit if user has not requested sync replication, or there are no
@@ -104,7 +116,33 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * need to be connected.
 	 */
 	if (!SyncRepRequested() || !SyncStandbysDefined())
-		return;
+		return true;
+
+	/*
+	 * If the caller has specified ForDataFlush, but synchronous transfer
+	 * is not specified or its turned off, exit.
+	 *
+	 * We would like to allow the failback safe mechanism even for cascaded
+	 * standbys as well. But we can't really wait for the standby to catch
+	 * up until we reach a consistent state since the standbys won't be
+	 * even able to connect without us reaching in that state (XXX Confirm)
+	 */
+	if (!SyncTransRequested() && ForDataFlush)
+		return true;
+
+	/*
+	 * If the caller has not specified ForDataFlush, but synchronous commit
+	 * is skipped by values of synchronous_transfer, exit.
+	 */
+	if (IsSyncRepSkipped() && !ForDataFlush)
+		return true;
+
+	/*
+	 * If both synchronous replication and synchronous transfer
+	 * are requested but the system still in recovery, exit.
+	 */
+	if (RecoveryInProgress())
+		return true;
 
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 	Assert(WalSndCtl != NULL);
@@ -120,11 +158,20 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * condition but we'll be fetching that cache line anyway so it's likely to
 	 * be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
 		XactCommitLSN <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
-		return;
+		return true;
+	}
+
+	/*
+	 * Exit if we are told not to block on the standby.
+	 */
+	if (!Wait)
+	{
+		LWLockRelease(SyncRepLock);
+		return false;
 	}
 
 	/*
@@ -151,6 +198,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		new_status[len] = '\0'; /* truncate off " waiting ..." */
 	}
 
+	ret = false;
+
 	/*
 	 * Wait for specified LSN to be confirmed.
 	 *
@@ -187,7 +236,10 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			LWLockRelease(SyncRepLock);
 		}
 		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+		{
+			ret = true;
 			break;
+		}
 
 		/*
 		 * If a wait for synchronous replication is pending, we can neither
@@ -264,6 +316,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		set_ps_display(new_status, false);
 		pfree(new_status);
 	}
+
+	return ret;
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f848391..7a2e285 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -41,6 +41,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -1975,8 +1977,14 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	 * skip the flush if the buffer isn't permanent.
 	 */
 	if (buf->flags & BM_PERMANENT)
+	{
 		XLogFlush(recptr);
-
+		/* If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to recptr.
+		 */
+		if (SyncTransRequested())
+			SyncRepWaitForLSN(recptr, true, true);
+	}
 	/*
 	 * Now it's safe to write buffer to disk. Note that no one else should
 	 * have been able to write it while we were busy with log flushing because
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 18f0342..6d0aa69 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -48,6 +48,8 @@
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
@@ -711,6 +713,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	int			fd;
 	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
+	XLogRecPtr	lsn=InvalidXLogRecPtr;
 
 	/*
 	 * Fill in the overhead fields and update CRC.
@@ -753,7 +756,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	{
 		xl_relmap_update xlrec;
 		XLogRecData rdata[2];
-		XLogRecPtr	lsn;
 
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
@@ -849,6 +851,13 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
+
+	/*
+	 * If synchronous transfer is requested, wait for failback safe
+	 * standby to receive WAL up to recptr.
+	 */
+	if (SyncTransRequested())
+		SyncRepWaitForLSN(lsn, true, true);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3107f9c..2e08f74 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -381,6 +381,16 @@ static const struct config_enum_entry synchronous_commit_options[] = {
 };
 
 /*
+ * Although only "all", "data_flush", and "commit" are documented
+ */
+static const struct config_enum_entry synchronous_transfer_options[] = {
+	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, false},
+	{NULL, 0, false}
+};
+
+/*
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
@@ -3300,6 +3310,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Sets the data flush synchronization level"),
+			NULL
+		},
+		&synchronous_transfer,
+		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..d6603c2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -220,6 +220,8 @@
 #synchronous_standby_names = ''	# standby servers that provide sync rep
 				# comma-separated list of application_name
 				# from standby(s); '*' = all
+#synchronous_transfer = commit	# data page synchronization level
+				# commit, data_flush or all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
 # - Standby Servers -
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index ed66c49..6cf3f26 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -60,6 +60,8 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
 #include "utils/tqual.h"
@@ -115,6 +117,18 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 
 		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
 			return;				/* not flushed yet, so don't set hint */
+
+		/*
+		 * If synchronous transfer is requested, we check if the commit WAL record
+		 * has made to the standby before allowing hint bit updates. We should not
+		 * wait for the standby to receive the WAL since its OK to delay hint bit
+		 * updates.
+		 */
+		if (SyncTransRequested())
+		{
+			if(!SyncRepWaitForLSN(commitLSN, true, false))
+				return;
+		}
 	}
 
 	tuple->t_infomask |= infomask;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ac23ea6..4e42eba 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -19,6 +19,12 @@
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
 
+#define SyncTransRequested() \
+	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+
+#define IsSyncRepSkipped() \
+	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+
 /* SyncRepWaitMode */
 #define SYNC_REP_NO_WAIT		-1
 #define SYNC_REP_WAIT_WRITE		0
@@ -31,11 +37,23 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+typedef enum
+{
+	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
+	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
+										 * no wait for WAL */
+	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush and WAL*/
+}	SynchronousTransferLevel;
+
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
 
+/* user-settable parameters for failback safe replication */
+extern int	synchronous_transfer;
+
 /* called by user backend */
-extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
+		bool ForDataFlush, bool Wait);
 
 /* called at backend exit */
 extern void SyncRepCleanupAtProcExit(void);

#78

Pavan Deolasee

pavan.deolasee@gmail.com

over 12 years ago

In reply to: Fujii Masao (#73)

Re: Patch for fail-back without fresh backup

On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Hmm... when synchronous_transfer is set to data_flush,
IMO the intuitive behaviors are

(1) synchronous_commit = on
A data flush should wait for the corresponding WAL to be
flushed in the standby

(2) synchronous_commit = remote_write
A data flush should wait for the corresponding WAL to be
written to OS in the standby.

(3) synchronous_commit = local
(4) synchronous_commit = off
A data flush should wait for the corresponding WAL to be
written locally in the master.

I thought synchronous_commit and synchronous_transfer are kind of
orthogonal to each other. synchronous_commit only controls whether and how
to wait for the standby only when a transaction commits.
synchronous_transfer OTOH tells how to interpret the standby listed in
synchronous_standbys parameter. If set to "commit" then they are
synchronous standbys (like today). If set to "data_flush", they are
asynchronous failback safe standby and if set to "all" then they are
synchronous failback safe standbys. Well, its confusing :-(

So IMHO in the current state of things, the synchronous_transfer GUC can
not be changed at a session/transaction level since all backends, including
background workers must honor the settings to guarantee failback safety.
synchronous_commit still works the same way, but is ignored if
synchronous_transfer is set to "data_flush" because that effectively tells
us that the standbys listed under synchronous_standbys are really *async*
standbys with failback safety.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#79

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Pavan Deolasee (#78)

Re: Patch for fail-back without fresh backup

On Thu, Sep 26, 2013 at 8:54 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Hmm... when synchronous_transfer is set to data_flush,
IMO the intuitive behaviors are

(1) synchronous_commit = on
A data flush should wait for the corresponding WAL to be
flushed in the standby

(2) synchronous_commit = remote_write
A data flush should wait for the corresponding WAL to be
written to OS in the standby.

(3) synchronous_commit = local
(4) synchronous_commit = off
A data flush should wait for the corresponding WAL to be
written locally in the master.

I thought synchronous_commit and synchronous_transfer are kind of orthogonal
to each other. synchronous_commit only controls whether and how to wait for
the standby only when a transaction commits. synchronous_transfer OTOH tells
how to interpret the standby listed in synchronous_standbys parameter. If
set to "commit" then they are synchronous standbys (like today). If set to
"data_flush", they are asynchronous failback safe standby and if set to
"all" then they are synchronous failback safe standbys. Well, its confusing
:-(

So IMHO in the current state of things, the synchronous_transfer GUC can not
be changed at a session/transaction level since all backends, including
background workers must honor the settings to guarantee failback safety.
synchronous_commit still works the same way, but is ignored if
synchronous_transfer is set to "data_flush" because that effectively tells
us that the standbys listed under synchronous_standbys are really *async*
standbys with failback safety.

Thank you for comment. I think it is good simple idea.
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Pavan Deolasee

pavan.deolasee@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#79)

Re: Patch for fail-back without fresh backup

On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:

Thank you for comment. I think it is good simple idea.
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?

No, user can still override the transaction commit point wait. So if

synchronous_transfer is set to "all":
- If synchronous_commit is ON - wait at all points
- If synchronous_commit is OFF - wait only at buffer flush (and other
related to failback safety) points

synchronous_transfer is set to "data_flush":
- If synchronous_commit is either ON o OFF - do not wait at commit points,
but wait at all other points

synchronous_transfer is set to "commit":
- If synchronous_commit is ON - wait at commit point
- If synchronous_commit is OFF - do not wait at any point

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#81

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Pavan Deolasee (#80)

Re: Patch for fail-back without fresh backup

On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:

Thank you for comment. I think it is good simple idea.
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?

No, user can still override the transaction commit point wait. So if

synchronous_transfer is set to "all":
- If synchronous_commit is ON - wait at all points
- If synchronous_commit is OFF - wait only at buffer flush (and other
related to failback safety) points

synchronous_transfer is set to "data_flush":
- If synchronous_commit is either ON o OFF - do not wait at commit points,
but wait at all other points

synchronous_transfer is set to "commit":
- If synchronous_commit is ON - wait at commit point
- If synchronous_commit is OFF - do not wait at any point

Thank you for explain. Understood.
if synchronous_transfer is set 'all' and user changes
synchronous_commit to 'off'( or 'local') at a transaction,
the master server wait at buffer flush, but doesn't wait at commit
points. Right?

In currently patch, synchronous_transfer works in cooperation with
synchronous_commit.
But if user changes synchronous_commit at a transaction, they are not
in cooperation.
So, your idea might be better than currently behaviour of synchronous_transfer.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#81)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Fri, Sep 27, 2013 at 6:44 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:

Thank you for comment. I think it is good simple idea.
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?

No, user can still override the transaction commit point wait. So if

synchronous_transfer is set to "all":
- If synchronous_commit is ON - wait at all points
- If synchronous_commit is OFF - wait only at buffer flush (and other
related to failback safety) points

synchronous_transfer is set to "data_flush":
- If synchronous_commit is either ON o OFF - do not wait at commit points,
but wait at all other points

synchronous_transfer is set to "commit":
- If synchronous_commit is ON - wait at commit point
- If synchronous_commit is OFF - do not wait at any point

Thank you for explain. Understood.
if synchronous_transfer is set 'all' and user changes
synchronous_commit to 'off'( or 'local') at a transaction,
the master server wait at buffer flush, but doesn't wait at commit
points. Right?

In currently patch, synchronous_transfer works in cooperation with
synchronous_commit.
But if user changes synchronous_commit at a transaction, they are not
in cooperation.
So, your idea might be better than currently behaviour of synchronous_transfer.

I attached the v11 patch which have fixed following contents.
- synchronous_transfer controls to wait at only data flush level,
synchronous_commit controls to wait at commit level. ( Based on
Pavan suggestion)
- If there are no sync replication standby name, both
synchronous_commit and synchronous_transfer
don't work.
- Fixed that we didn't support failback-safe standby.
Previous patch can not support failback-safe standby. Because the
patch doesn't wait at FlushBuffer
which is called by autovacuum.

So, if user want to do transaction early temporarily, user need to
change the synchronous_transfer value and reload
postgresql.conf.

Regards,

-------
Sawada Masahiko

Attachments:

synchronous_transfer_v11.patchapplication/octet-stream; name=synchronous_transfer_v11.patchDownload

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1749,1754 **** include 'filename'
--- 1749,1787 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-synchronous-transfer" xreflabel="synchronous_transfer">
+       <term><varname>synchronous_transfer</varname> (<type>enum</type>)</term>
+       <indexterm>
+        <primary><varname>synchronous_transfer</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         This parameter controls the synchronous nature of WAL transfer and
+         maintains file system level consistency between master server and
+         standby server. Specifies whether master server will wait for file
+         system level change (for example : modifying data page) before
+         corresponding WAL records are replicated to the standby server.
+        </para>
+        <para>
+         Valid values are <literal>commit</>, <literal>data_flush</> and
+         <literal>all</>. The default value is <literal>commit</>, meaning
+         that master only wait for transaction commits, this is equivalent
+         to turning off <literal>synchronous_transfer</> parameter and standby
+         server will behave as a <quote>synchronous standby </> in
+         Streaming Replication. For value <literal>data_flush</>, master will
+         wait only for data page modifications but not for transaction
+         commits, hence the standby server will act as <quote>asynchronous
+         failback safe standby</>. For value <literal> all</>, master will wait
+         for data page modifications as well as for transaction commits and
+         resultant standby server will act as <quote>synchronous failback safe
+         standby</>. The wait is on background activities and hence will not
+         create much performance overhead.
+         To configure synchronous failback safe standby
+         <xref linkend="guc-synchronous-standby-names"> should be set.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-wal-sync-method" xreflabel="wal_sync_method">
        <term><varname>wal_sync_method</varname> (<type>enum</type>)</term>
        <indexterm>
***************
*** 2258,2271 **** include 'filename'
        </indexterm>
        <listitem>
         <para>
!         Specifies a comma-separated list of standby names that can support
!         <firstterm>synchronous replication</>, as described in
!         <xref linkend="synchronous-replication">.
!         At any one time there will be at most one active synchronous standby;
!         transactions waiting for commit will be allowed to proceed after
!         this standby server confirms receipt of their data.
!         The synchronous standby will be the first standby named in this list
!         that is both currently connected and streaming data in real-time
          (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
--- 2291,2315 ----
        </indexterm>
        <listitem>
         <para>
!         Specifies a comma-separated list of standby names. If this parameter
!         is set then standby will behave as synchronous standby in replication,
!         as described in <xref linkend="synchronous-replication"> or synchronous
!         failback safe standby, as described in <xref linkend="failback-safe">.
!         At any time there will be at most one active standby; when standby is
!         synchronous standby in replication, transactions waiting for commit
!         will be allowed to proceed after this standby server confirms receipt
!         of their data. But when standby is synchronous failback safe standby
!         data page modifications as well as transaction commits will be allowed
!         to proceed only after this standby server confirms receipt of their data.
!         If this parameter is set to empty value and
!         <xref linkend="guc-synchronous-transfer"> is set to <literal>data_flush</>
!         then standby is called as asynchronous failback safe standby and only
!         data page modifications will wait before corresponding WAL record is
!         replicated to standby.
!        </para>
!        <para>
!         Synchronous standby in replication will be the first standby named in
!         this list that is both currently connected and streaming data in real-time
          (as shown by a state of <literal>streaming</literal> in the
          <link linkend="monitoring-stats-views-table">
          <literal>pg_stat_replication</></link> view).
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1140,1145 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 1140,1203 ----
  
     </sect3>
    </sect2>
+ 
+   <sect2 id="failback-safe">
+      <title>Setting up failback safe standby</title>
+ 
+    <indexterm zone="high-availability">
+        <primary>Setting up failback safe standby</primary>
+    </indexterm>
+ 
+    <para>
+     PostgreSQL Streaming Replication offers durability, but if the master
+     crashes and particular WAL record is unable to reach to standby
+     server, then that WAL record is present on master server but not
+     on standby server. In such a case master is ahead of standby server
+     in term of WAL records and Data in database. This will lead to
+     file-system level inconsistency between master and standby server.
+     For example a heap page update on the master might not have been reflected
+     on standby when master crashes.
+    </para>
+ 
+    <para>
+     Due to this inconsistency fresh backup of new master onto new standby
+     is needed to re-prepare HA cluster. Taking fresh backup can be very
+     time consuming process when database is of large size. In such a case
+     disaster recovery can take very long time if Streaming Replication is
+     used to setup the high availability cluster.
+    </para>
+ 
+    <para>
+     If HA cluster is configured with failback safe standby then this fresh
+     back up can be avoided. The <xref linkend="guc-synchronous-transfer">
+     parameter has control over all WAL transfers and will not make any file
+     system level change until master gets a confirmation from standby server.
+     This avoids the need of a fresh backup by maintaining consistency.
+    </para>
+ 
+    <sect3 id="Failback-safe-config">
+     <title>Basic Configuration</title>
+    <para>
+     Failback safe standby can be asynchronous or synchronous in nature.
+     This will depend upon whether master will wait for transaction commit
+     or not. By default failback safe mechanism is turned off.
+    </para>
+ 
+    <para>
+     The first step to configure HA with failback safe standby is to setup
+     synchronous streaming replication. Configuring synchronous failback
+     safe standby requires setting up  <xref linkend="guc-synchronous-transfer"> to
+     <literal>all</>. This configuration will cause each commit and data
+     page modification to wait for confirmation that standby has written
+     corresponding WAL record to durable storage. Configuring asynchronous
+     failback safe standby requires setting up <xref linkend="guc-synchronous-transfer">
+     to <literal> data_flush</>. This configuration will cause only data
+     page modifications to wait for confirmation that standby has written
+     corresponding WAL record to durable storage.
+    </para>
+ 
+   </sect3>
+   </sect2>
    </sect1>
  
    <sect1 id="warm-standby-failover">
***************
*** 1201,1212 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
     </para>
  
     <para>
!     So, switching from primary to standby server can be fast but requires
!     some time to re-prepare the failover cluster. Regular switching from
!     primary to standby is useful, since it allows regular downtime on
!     each system for maintenance. This also serves as a test of the
!     failover mechanism to ensure that it will really work when you need it.
!     Written administration procedures are advised.
     </para>
  
     <para>
--- 1259,1286 ----
     </para>
  
     <para>
!     At the time of failover there is a possibility of file-system level
!     inconsistency between old primary and old standby server hence
!     fresh backup from new master onto old master is needed for Configuring
!     old standby server as a new standby server. Without taking fresh
!     backup even if the new standby starts, streaming replication does not
!     start successfully. The activity of taking backup can be fast for smaller
!     database but for large database requires more time to re-prepare the
!     failover cluster and could break the service level agreement of crash
!     recovery. The need of fresh backup and problem of long
!     recovery time can be solved if HA cluster is configured with
!     failback safe standby see <xref linkend="failback-safe">.
!     Failback safe standby makes WAL transfer synchronous at required
!     places and maintains the file-system level consistency between
!     master and standby server and the old master server can be easily
!     configured as new standby server.
!    </para>
! 
!    <para>
!     Regular switching from primary to standby is useful, since it allows
!     regular downtime on each system for maintenance. This also serves as
!     a test of the failover mechanism to ensure that it will really work
!     when you need it. Written administration procedures are advised.
     </para>
  
     <para>
*** a/doc/src/sgml/perform.sgml
--- b/doc/src/sgml/perform.sgml
***************
*** 1569,1574 **** SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
--- 1569,1582 ----
         corruption) in case of a crash of the <emphasis>database</> alone.
        </para>
       </listitem>
+ 
+      <listitem>
+       <para>
+        Set <xref linkend="guc-synchronous-transfer"> to commit; there is no
+        need to guard against database inconsistency between master and standby
+        server during failover.
+       </para>
+      </listitem>
      </itemizedlist>
     </para>
    </sect1>
*** a/src/backend/access/transam/clog.c
--- b/src/backend/access/transam/clog.c
***************
*** 37,42 ****
--- 37,44 ----
  #include "access/transam.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  
  /*
   * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
***************
*** 708,715 **** WriteZeroPageXlogRec(int pageno)
  /*
   * Write a TRUNCATE xlog record
   *
!  * We must flush the xlog record to disk before returning --- see notes
!  * in TruncateCLOG().
   */
  static void
  WriteTruncateXlogRec(int pageno)
--- 710,719 ----
  /*
   * Write a TRUNCATE xlog record
   *
!  * Before returning we must flush the xlog record to disk
!  * and if synchronous transfer is requested wait for failback
!  * safe standby to receive WAL up to recptr.
!  * --- see notes in TruncateCLOG().
   */
  static void
  WriteTruncateXlogRec(int pageno)
***************
*** 723,728 **** WriteTruncateXlogRec(int pageno)
--- 727,738 ----
  	rdata.next = NULL;
  	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
  	XLogFlush(recptr);
+ 
+ 	/*
+ 	 * Wait for failback safe standby.
+ 	 */
+ 	if (SyncTransRequested())
+ 		SyncRepWaitForLSN(recptr, true, true);
  }
  
  /*
*** a/src/backend/access/transam/slru.c
--- b/src/backend/access/transam/slru.c
***************
*** 54,59 ****
--- 54,61 ----
  #include "access/slru.h"
  #include "access/transam.h"
  #include "access/xlog.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/fd.h"
  #include "storage/shmem.h"
  #include "miscadmin.h"
***************
*** 744,749 **** SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
--- 746,758 ----
  			START_CRIT_SECTION();
  			XLogFlush(max_lsn);
  			END_CRIT_SECTION();
+ 
+ 			/*
+ 			 * If synchronous transfer is requested, wait for failback safe
+ 			 * standby to receive WAL up to max_lsn.
+ 			 */
+ 			if (SyncTransRequested())
+ 				SyncRepWaitForLSN(max_lsn, true, true);
  		}
  	}
  
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1091,1102 **** EndPrepare(GlobalTransaction gxact)
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked the prepare, but still show as
  	 * running in the procarray (twice!) and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(gxact->prepare_lsn);
  
  	records.tail = records.head = NULL;
  }
--- 1091,1102 ----
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked the prepare, but still show as
  	 * running in the procarray (twice!) and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(gxact->prepare_lsn, false, true);
  
  	records.tail = records.head = NULL;
  }
***************
*** 2058,2069 **** RecordTransactionCommitPrepared(TransactionId xid,
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr);
  }
  
  /*
--- 2058,2069 ----
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr, false, true);
  }
  
  /*
***************
*** 2138,2147 **** RecordTransactionAbortPrepared(TransactionId xid,
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr);
  }
--- 2138,2147 ----
  	END_CRIT_SECTION();
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
! 	SyncRepWaitForLSN(recptr, false, true);
  }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 1189,1201 **** RecordTransactionCommit(void)
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
  	/*
! 	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
--- 1189,1201 ----
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
  	/*
! 	 * Wait for synchronous/synchronous failback safe standby, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
  	 * in the procarray and continue to hold locks.
  	 */
  	if (wrote_xlog)
! 		SyncRepWaitForLSN(XactLastRecEnd, false, true);
  
  	/* Reset XactLastRecEnd until the next transaction writes something */
  	XactLastRecEnd = 0;
***************
*** 4690,4697 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
--- 4690,4706 ----
  	 * for any user that requested ForceSyncCommit().
  	 */
  	if (XactCompletionForceSyncCommit(xinfo))
+ 	{
  		XLogFlush(lsn);
  
+ 		/*
+ 		 * If synchronous transfer is requested, wait for failback safe
+ 		 * standby to receive WAL up to lsn,
+ 		 */
+ 		if (SyncTransRequested())
+ 			SyncRepWaitForLSN(lsn, true, true);
+ 
+ 	}
  }
  
  /*
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 39,44 ****
--- 39,45 ----
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
  #include "postmaster/startup.h"
+ #include "replication/syncrep.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
  #include "storage/barrier.h"
***************
*** 8282,8287 **** CreateCheckPoint(int flags)
--- 8283,8300 ----
  	END_CRIT_SECTION();
  
  	/*
+ 	 * If synchronous transfer is requested, wait for failback safe standby
+ 	 * to receive WAL up to checkpoint WAL record. Otherwise if failure occurs
+ 	 * before standby receives CHECKPOINT WAL record causes an inconsistency
+ 	 * between control files of master and standby. Because of this master will
+ 	 * start from a location which is not known to the standby at the time fail-over.
+ 	 *
+ 	 * There is no need to wait for shutdown CHECKPOINT.
+ 	 */
+ 	if (SyncTransRequested())
+ 		SyncRepWaitForLSN(recptr, true, !shutdown);
+ 
+ 	/*
  	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
  	 */
  	smgrpostckpt();
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 25,30 ****
--- 25,32 ----
  #include "catalog/catalog.h"
  #include "catalog/storage.h"
  #include "catalog/storage_xlog.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/freespace.h"
  #include "storage/smgr.h"
  #include "utils/memutils.h"
***************
*** 288,293 **** RelationTruncate(Relation rel, BlockNumber nblocks)
--- 290,303 ----
  		 */
  		if (fsm || vm)
  			XLogFlush(lsn);
+ 
+ 		/*
+ 		 * If synchronous transfer is requested, wait for failback safe standby
+ 		 * to receive WAL up to lsn. Otherwise, we may have a situation where
+ 		 * the heap is truncated, but the action never replayed on the standby.
+ 		 */
+ 		if (SyncTransRequested())
+ 			SyncRepWaitForLSN(lsn, true, true);
  	}
  
  	/* Do the real work */
***************
*** 521,526 **** smgr_redo(XLogRecPtr lsn, XLogRecord *record)
--- 531,543 ----
  		 */
  		XLogFlush(lsn);
  
+ 		/*
+ 		 * If synchronous transfer is requested, wait for failback safe standby
+ 		 * to receive WAL up to lsn.
+ 		 */
+ 		if (SyncTransRequested())
+ 			SyncRepWaitForLSN(lsn, true, true);
+ 
  		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
  
  		/* Also tell xlogutils.c about it */
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
***************
*** 85,90 ****
--- 85,92 ----
  #include "postmaster/autovacuum.h"
  #include "postmaster/fork_process.h"
  #include "postmaster/postmaster.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
  #include "storage/latch.h"
***************
*** 1591,1598 **** AutoVacWorkerMain(int argc, char *argv[])
  	 * Force synchronous replication off to allow regular maintenance even if
  	 * we are waiting for standbys to connect. This is important to ensure we
  	 * aren't blocked from performing anti-wraparound tasks.
  	 */
! 	if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  		SetConfigOption("synchronous_commit", "local",
  						PGC_SUSET, PGC_S_OVERRIDE);
  
--- 1593,1602 ----
  	 * Force synchronous replication off to allow regular maintenance even if
  	 * we are waiting for standbys to connect. This is important to ensure we
  	 * aren't blocked from performing anti-wraparound tasks.
+ 	 * Note that if sync transfer is requested, we can't regular maintenance until
+ 	 * standbys to connect.
  	 */
! 	if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH && !SyncTransRequested())
  		SetConfigOption("synchronous_commit", "local",
  						PGC_SUSET, PGC_S_OVERRIDE);
  
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 67,72 **** static bool announce_next_takeover = true;
--- 67,74 ----
  
  static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
  
+ int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
+ 
  static void SyncRepQueueInsert(int mode);
  static void SyncRepCancelWait(void);
  
***************
*** 83,110 **** static bool SyncRepQueueIsOrderedByLSN(int mode);
   */
  
  /*
!  * Wait for synchronous replication, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING before adding ourselves
!  * to the wait queue. During SyncRepWakeQueue() a WALSender changes
!  * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
!  * This backend then resets its state to SYNC_REP_NOT_WAITING.
   */
! void
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
  	int			mode = SyncRepWaitMode;
  
  	/*
! 	 * Fast exit if user has not requested sync replication, or there are no
! 	 * sync replication standby names defined. Note that those standbys don't
! 	 * need to be connected.
  	 */
! 	if (!SyncRepRequested() || !SyncStandbysDefined())
! 		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
--- 85,151 ----
   */
  
  /*
!  * Wait for synchronous/failback safe standby, if requested by user.
   *
   * Initially backends start in state SYNC_REP_NOT_WAITING and then
!  * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
!  * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
!  * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
!  * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
!  *
!  * ForDataFlush - if TRUE, we wait before flushing data page.
!  * Otherwise wait for the sync standby
!  *
!  * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
!  * the standby has already made progressed upto the given XactCommitLSN
!  *
!  * Return TRUE if either the synchronous standby/failback safe standby is not
!  * configured/turned off OR the standby has made enough progress
   */
! bool
! SyncRepWaitForLSN(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
  {
  	char	   *new_status = NULL;
  	const char *old_status;
  	int			mode = SyncRepWaitMode;
+ 	bool		ret;
  
  	/*
! 	 * Fast exit If there are no sync replication standby names defined.
! 	 * Note that those standbys don't need to be connected.
  	 */
! 	if (!SyncStandbysDefined())
! 		return true;
! 
! 	/* if user has not requested sync replication, exit */
! 	if (!SyncRepRequested()  && !ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If the caller has specified ForDataFlush, but synchronous transfer
! 	 * is not specified or its turned off, exit.
! 	 *
! 	 * We would like to allow the failback safe mechanism even for cascaded
! 	 * standbys as well. But we can't really wait for the standby to catch
! 	 * up until we reach a consistent state since the standbys won't be
! 	 * even able to connect without us reaching in that state (XXX Confirm)
! 	 */
! 	if (!SyncTransRequested() && ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If the caller has not specified ForDataFlush, but synchronous commit
! 	 * is skipped by values of synchronous_transfer, exit.
! 	 */
! 	if (IsSyncRepSkipped() && !ForDataFlush)
! 		return true;
! 
! 	/*
! 	 * If both synchronous replication and synchronous transfer
! 	 * are requested but the system still in recovery, exit.
! 	 */
! 	if (RecoveryInProgress())
! 		return true;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  	Assert(WalSndCtl != NULL);
***************
*** 120,130 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * condition but we'll be fetching that cache line anyway so it's likely to
  	 * be a low cost check.
  	 */
! 	if (!WalSndCtl->sync_standbys_defined ||
  		XactCommitLSN <= WalSndCtl->lsn[mode])
  	{
  		LWLockRelease(SyncRepLock);
! 		return;
  	}
  
  	/*
--- 161,180 ----
  	 * condition but we'll be fetching that cache line anyway so it's likely to
  	 * be a low cost check.
  	 */
! 	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
  		XactCommitLSN <= WalSndCtl->lsn[mode])
  	{
  		LWLockRelease(SyncRepLock);
! 		return true;
! 	}
! 
! 	/*
! 	 * Exit if we are told not to block on the standby.
! 	 */
! 	if (!Wait)
! 	{
! 		LWLockRelease(SyncRepLock);
! 		return false;
  	}
  
  	/*
***************
*** 151,156 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 201,208 ----
  		new_status[len] = '\0'; /* truncate off " waiting ..." */
  	}
  
+ 	ret = false;
+ 
  	/*
  	 * Wait for specified LSN to be confirmed.
  	 *
***************
*** 187,193 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 239,248 ----
  			LWLockRelease(SyncRepLock);
  		}
  		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+ 		{
+ 			ret = true;
  			break;
+ 		}
  
  		/*
  		 * If a wait for synchronous replication is pending, we can neither
***************
*** 264,269 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 319,326 ----
  		set_ps_display(new_status, false);
  		pfree(new_status);
  	}
+ 
+ 	return ret;
  }
  
  /*
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 41,46 ****
--- 41,48 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***************
*** 1975,1982 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
  		XLogFlush(recptr);
! 
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
  	 * have been able to write it while we were busy with log flushing because
--- 1977,1990 ----
  	 * skip the flush if the buffer isn't permanent.
  	 */
  	if (buf->flags & BM_PERMANENT)
+ 	{
  		XLogFlush(recptr);
! 		/* If synchronous transfer is requested, wait for failback safe standby
! 		 * to receive WAL up to recptr.
! 		 */
! 		if (SyncTransRequested())
! 			SyncRepWaitForLSN(recptr, true, true);
! 	}
  	/*
  	 * Now it's safe to write buffer to disk. Note that no one else should
  	 * have been able to write it while we were busy with log flushing because
*** a/src/backend/utils/cache/relmapper.c
--- b/src/backend/utils/cache/relmapper.c
***************
*** 48,53 ****
--- 48,55 ----
  #include "catalog/pg_tablespace.h"
  #include "catalog/storage.h"
  #include "miscadmin.h"
+ #include "replication/syncrep.h"
+ #include "replication/walsender.h"
  #include "storage/fd.h"
  #include "storage/lwlock.h"
  #include "utils/inval.h"
***************
*** 711,716 **** write_relmap_file(bool shared, RelMapFile *newmap,
--- 713,719 ----
  	int			fd;
  	RelMapFile *realmap;
  	char		mapfilename[MAXPGPATH];
+ 	XLogRecPtr	lsn=InvalidXLogRecPtr;
  
  	/*
  	 * Fill in the overhead fields and update CRC.
***************
*** 753,759 **** write_relmap_file(bool shared, RelMapFile *newmap,
  	{
  		xl_relmap_update xlrec;
  		XLogRecData rdata[2];
- 		XLogRecPtr	lsn;
  
  		/* now errors are fatal ... */
  		START_CRIT_SECTION();
--- 756,761 ----
***************
*** 849,854 **** write_relmap_file(bool shared, RelMapFile *newmap,
--- 851,863 ----
  	/* Critical section done */
  	if (write_wal)
  		END_CRIT_SECTION();
+ 
+ 	/*
+ 	 * If synchronous transfer is requested, wait for failback safe
+ 	 * standby to receive WAL up to recptr.
+ 	 */
+ 	if (SyncTransRequested())
+ 		SyncRepWaitForLSN(lsn, true, true);
  }
  
  /*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 381,386 **** static const struct config_enum_entry synchronous_commit_options[] = {
--- 381,396 ----
  };
  
  /*
+  * Although only "all", "data_flush", and "commit" are documented
+  */
+ static const struct config_enum_entry synchronous_transfer_options[] = {
+ 	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+ 	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+ 	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, false},
+ 	{NULL, 0, false}
+ };
+ 
+ /*
   * Options for enum values stored in other modules
   */
  extern const struct config_enum_entry wal_level_options[];
***************
*** 3300,3305 **** static struct config_enum ConfigureNamesEnum[] =
--- 3310,3325 ----
  	},
  
  	{
+ 		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+ 			gettext_noop("Sets the data flush synchronization level"),
+ 			NULL
+ 		},
+ 		&synchronous_transfer,
+ 		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
  			gettext_noop("Enables logging of recovery-related debugging information."),
  			gettext_noop("Each level includes all the levels that follow it. The later"
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 220,225 ****
--- 220,227 ----
  #synchronous_standby_names = ''	# standby servers that provide sync rep
  				# comma-separated list of application_name
  				# from standby(s); '*' = all
+ #synchronous_transfer = commit	# data page synchronization level
+ 				# commit, data_flush or all
  #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
  
  # - Standby Servers -
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 60,65 ****
--- 60,67 ----
  #include "access/subtrans.h"
  #include "access/transam.h"
  #include "access/xact.h"
+ #include "replication/walsender.h"
+ #include "replication/syncrep.h"
  #include "storage/bufmgr.h"
  #include "storage/procarray.h"
  #include "utils/tqual.h"
***************
*** 115,120 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer,
--- 117,134 ----
  
  		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
  			return;				/* not flushed yet, so don't set hint */
+ 
+ 		/*
+ 		 * If synchronous transfer is requested, we check if the commit WAL record
+ 		 * has made to the standby before allowing hint bit updates. We should not
+ 		 * wait for the standby to receive the WAL since its OK to delay hint bit
+ 		 * updates.
+ 		 */
+ 		if (SyncTransRequested())
+ 		{
+ 			if(!SyncRepWaitForLSN(commitLSN, true, false))
+ 				return;
+ 		}
  	}
  
  	tuple->t_infomask |= infomask;
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 19,24 ****
--- 19,30 ----
  #define SyncRepRequested() \
  	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
  
+ #define SyncTransRequested() \
+ 	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+ 
+ #define IsSyncRepSkipped() \
+ 	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+ 
  /* SyncRepWaitMode */
  #define SYNC_REP_NO_WAIT		-1
  #define SYNC_REP_WAIT_WRITE		0
***************
*** 31,41 ****
  #define SYNC_REP_WAITING			1
  #define SYNC_REP_WAIT_COMPLETE		2
  
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
  /* called by user backend */
! extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);
--- 37,59 ----
  #define SYNC_REP_WAITING			1
  #define SYNC_REP_WAIT_COMPLETE		2
  
+ typedef enum
+ {
+ 	SYNCHRONOUS_TRANSFER_COMMIT,		/* no wait for flush data page */
+ 	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* wait for data page flush only
+ 										 * no wait for WAL */
+ 	SYNCHRONOUS_TRANSFER_ALL	        /* wait for data page flush and WAL*/
+ }	SynchronousTransferLevel;
+ 
  /* user-settable parameters for synchronous replication */
  extern char *SyncRepStandbyNames;
  
+ /* user-settable parameters for failback safe replication */
+ extern int	synchronous_transfer;
+ 
  /* called by user backend */
! extern bool SyncRepWaitForLSN(XLogRecPtr XactCommitLSN,
! 		bool ForDataFlush, bool Wait);
  
  /* called at backend exit */
  extern void SyncRepCleanupAtProcExit(void);

#83

Fujii Masao

masao.fujii@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#82)

Re: Patch for fail-back without fresh backup

On Fri, Oct 4, 2013 at 1:46 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Fri, Sep 27, 2013 at 6:44 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:

Thank you for comment. I think it is good simple idea.
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?

No, user can still override the transaction commit point wait. So if

synchronous_transfer is set to "all":
- If synchronous_commit is ON - wait at all points
- If synchronous_commit is OFF - wait only at buffer flush (and other
related to failback safety) points

synchronous_transfer is set to "data_flush":
- If synchronous_commit is either ON o OFF - do not wait at commit points,
but wait at all other points

synchronous_transfer is set to "commit":
- If synchronous_commit is ON - wait at commit point
- If synchronous_commit is OFF - do not wait at any point

Thank you for explain. Understood.
if synchronous_transfer is set 'all' and user changes
synchronous_commit to 'off'( or 'local') at a transaction,
the master server wait at buffer flush, but doesn't wait at commit
points. Right?

In currently patch, synchronous_transfer works in cooperation with
synchronous_commit.
But if user changes synchronous_commit at a transaction, they are not
in cooperation.
So, your idea might be better than currently behaviour of synchronous_transfer.

I attached the v11 patch which have fixed following contents.

You added several checks into SyncRepWaitForLSN() so that it can handle both
synchronous_transfer=data_flush and =commit. This change made the source code
of the function very complicated, I'm afraid. To simplify the source code,
what about just adding new wait-for-lsn function for data_flush instead of
changing SyncRepWaitForLSN()? Obviously that new function and
SyncRepWaitForLSN()
have the common part. I think that it should be extracted as separate function.

+     * Note that if sync transfer is requested, we can't regular
maintenance until
+     * standbys to connect.
      */
-    if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
+    if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH &&
!SyncTransRequested())

Per discussion with Pavan, ISTM we don't need to avoid setting
synchronous_commit
to local even if synchronous_transfer is data_flush. But you did that here. Why?

When synchronous_transfer = data_flush, anti-wraparound vacuum can be blocked.
Is this safe?

+#synchronous_transfer = commit    # data page synchronization level
+                # commit, data_flush or all

This comment seems confusing. I think that this parameter specifies when to
wait for replication.

+typedef enum
+{
+    SYNCHRONOUS_TRANSFER_COMMIT,        /* no wait for flush data page */
+    SYNCHRONOUS_TRANSFER_DATA_FLUSH,    /* wait for data page flush only
+                                         * no wait for WAL */
+    SYNCHRONOUS_TRANSFER_ALL            /* wait for data page flush and WAL*/
+}    SynchronousTransferLevel;

These comments also seem confusing. For example, I think that the meaning of
SYNCHRONOUS_TRANSFER_COMMIT is something like "wait for replication on
transaction commit".

@@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
*/
XLogFlush(lsn);

+        /*
+         * If synchronous transfer is requested, wait for failback safe standby
+         * to receive WAL up to lsn.
+         */
+        if (SyncTransRequested())
+            SyncRepWaitForLSN(lsn, true, true);

If smgr_redo() is called only during recovery, SyncRepWaitForLSN() doesn't need
to be called here.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Fujii Masao (#83)

1 attachment(s)

Re: Patch for fail-back without fresh backup

On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

You added several checks into SyncRepWaitForLSN() so that it can handle both
synchronous_transfer=data_flush and =commit. This change made the source code
of the function very complicated, I'm afraid. To simplify the source code,
what about just adding new wait-for-lsn function for data_flush instead of
changing SyncRepWaitForLSN()? Obviously that new function and
SyncRepWaitForLSN()
have the common part. I think that it should be extracted as separate function.

Thank you for reviewing and comment!

yes I agree with you.
I attached the v12 patch which have modified based on above suggestions.
- Added new function SyncRepTransferWaitForLSN() and SyncRepWait()
SyncRepTransferWaitForLSN() is called on date page flush. OTOH,
SyncRepWatiForLSN() is called on transaction commit.
And both functions call the SyncRepWait() after check whether sync
commit/transfer is requested.
Practically server will waits at SyncRepWait().

+     * Note that if sync transfer is requested, we can't regular
maintenance until
+     * standbys to connect.
*/
-    if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
+    if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH &&
!SyncTransRequested())
Per discussion with Pavan, ISTM we don't need to avoid setting
synchronous_commit
to local even if synchronous_transfer is data_flush. But you did that here. Why?

I made a mistake. I have removed it.

When synchronous_transfer = data_flush, anti-wraparound vacuum can be blocked.
Is this safe?

In the new version patch, when synchronous_transfer = data_flush/all
AND synchronous_standby_names is set,
vacuum is blocked.
This behaviour of synchronous_transfer similar to synchronous_commit.
Should this allow to do anti-wraparound vacuum even if
synchronous_transfer = data_flush/all?
If so, it also allow to flush the data page while doing vacuum?

+#synchronous_transfer = commit    # data page synchronization level
+                # commit, data_flush or all
This comment seems confusing. I think that this parameter specifies when to
wait for replication.
+typedef enum
+{
+    SYNCHRONOUS_TRANSFER_COMMIT,        /* no wait for flush data page */
+    SYNCHRONOUS_TRANSFER_DATA_FLUSH,    /* wait for data page flush only
+                                         * no wait for WAL */
+    SYNCHRONOUS_TRANSFER_ALL            /* wait for data page flush and WAL*/
+}    SynchronousTransferLevel;
These comments also seem confusing. For example, I think that the meaning of
SYNCHRONOUS_TRANSFER_COMMIT is something like "wait for replication on
transaction commit".

Those comments are modified in new patch.

@@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
*/
XLogFlush(lsn);
+        /*
+         * If synchronous transfer is requested, wait for failback safe standby
+         * to receive WAL up to lsn.
+         */
+        if (SyncTransRequested())
+            SyncRepWaitForLSN(lsn, true, true);
If smgr_redo() is called only during recovery, SyncRepWaitForLSN() doesn't need
to be called here.

Thank you for info.
I have removed it at smgr_redo().

Regards,

-------
Sawada Masahiko

Attachments:

synchronous_transfer_v12.patchapplication/octet-stream; name=synchronous_transfer_v12.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 370aa09..86d2265 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1749,6 +1749,39 @@ include 'filename'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-synchronous-transfer" xreflabel="synchronous_transfer">
+      <term><varname>synchronous_transfer</varname> (<type>enum</type>)</term>
+      <indexterm>
+       <primary><varname>synchronous_transfer</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        This parameter controls the synchronous nature of WAL transfer and
+        maintains file system level consistency between master server and
+        standby server. Specifies whether master server will wait for file
+        system level change (for example : modifying data page) before
+        corresponding WAL records are replicated to the standby server.
+       </para>
+       <para>
+        Valid values are <literal>commit</>, <literal>data_flush</> and
+        <literal>all</>. The default value is <literal>commit</>, meaning
+        that master only wait for transaction commits, this is equivalent
+        to turning off <literal>synchronous_transfer</> parameter and standby
+        server will behave as a <quote>synchronous standby </> in
+        Streaming Replication. For value <literal>data_flush</>, master will
+        wait only for data page modifications but not for transaction
+        commits, hence the standby server will act as <quote>asynchronous
+        failback safe standby</>. For value <literal> all</>, master will wait
+        for data page modifications as well as for transaction commits and
+        resultant standby server will act as <quote>synchronous failback safe
+        standby</>. The wait is on background activities and hence will not
+        create much performance overhead.
+        To configure synchronous failback safe standby
+        <xref linkend="guc-synchronous-standby-names"> should be set.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-wal-sync-method" xreflabel="wal_sync_method">
       <term><varname>wal_sync_method</varname> (<type>enum</type>)</term>
       <indexterm>
@@ -2258,14 +2291,25 @@ include 'filename'
       </indexterm>
       <listitem>
        <para>
-        Specifies a comma-separated list of standby names that can support
-        <firstterm>synchronous replication</>, as described in
-        <xref linkend="synchronous-replication">.
-        At any one time there will be at most one active synchronous standby;
-        transactions waiting for commit will be allowed to proceed after
-        this standby server confirms receipt of their data.
-        The synchronous standby will be the first standby named in this list
-        that is both currently connected and streaming data in real-time
+        Specifies a comma-separated list of standby names. If this parameter
+        is set then standby will behave as synchronous standby in replication,
+        as described in <xref linkend="synchronous-replication"> or synchronous
+        failback safe standby, as described in <xref linkend="failback-safe">.
+        At any time there will be at most one active standby; when standby is
+        synchronous standby in replication, transactions waiting for commit
+        will be allowed to proceed after this standby server confirms receipt
+        of their data. But when standby is synchronous failback safe standby
+        data page modifications as well as transaction commits will be allowed
+        to proceed only after this standby server confirms receipt of their data.
+        If this parameter is set to empty value and
+        <xref linkend="guc-synchronous-transfer"> is set to <literal>data_flush</>
+        then standby is called as asynchronous failback safe standby and only
+        data page modifications will wait before corresponding WAL record is
+        replicated to standby.
+       </para>
+       <para>
+        Synchronous standby in replication will be the first standby named in
+        this list that is both currently connected and streaming data in real-time
         (as shown by a state of <literal>streaming</literal> in the
         <link linkend="monitoring-stats-views-table">
         <literal>pg_stat_replication</></link> view).
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index c8f6fa8..e551077 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1140,6 +1140,64 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
 
    </sect3>
   </sect2>
+
+  <sect2 id="failback-safe">
+     <title>Setting up failback safe standby</title>
+
+   <indexterm zone="high-availability">
+       <primary>Setting up failback safe standby</primary>
+   </indexterm>
+
+   <para>
+    PostgreSQL Streaming Replication offers durability, but if the master
+    crashes and particular WAL record is unable to reach to standby
+    server, then that WAL record is present on master server but not
+    on standby server. In such a case master is ahead of standby server
+    in term of WAL records and Data in database. This will lead to
+    file-system level inconsistency between master and standby server.
+    For example a heap page update on the master might not have been reflected
+    on standby when master crashes.
+   </para>
+
+   <para>
+    Due to this inconsistency fresh backup of new master onto new standby
+    is needed to re-prepare HA cluster. Taking fresh backup can be very
+    time consuming process when database is of large size. In such a case
+    disaster recovery can take very long time if Streaming Replication is
+    used to setup the high availability cluster.
+   </para>
+
+   <para>
+    If HA cluster is configured with failback safe standby then this fresh
+    back up can be avoided. The <xref linkend="guc-synchronous-transfer">
+    parameter has control over all WAL transfers and will not make any file
+    system level change until master gets a confirmation from standby server.
+    This avoids the need of a fresh backup by maintaining consistency.
+   </para>
+
+   <sect3 id="Failback-safe-config">
+    <title>Basic Configuration</title>
+   <para>
+    Failback safe standby can be asynchronous or synchronous in nature.
+    This will depend upon whether master will wait for transaction commit
+    or not. By default failback safe mechanism is turned off.
+   </para>
+
+   <para>
+    The first step to configure HA with failback safe standby is to setup
+    synchronous streaming replication. Configuring synchronous failback
+    safe standby requires setting up  <xref linkend="guc-synchronous-transfer"> to
+    <literal>all</>. This configuration will cause each commit and data
+    page modification to wait for confirmation that standby has written
+    corresponding WAL record to durable storage. Configuring asynchronous
+    failback safe standby requires setting up <xref linkend="guc-synchronous-transfer">
+    to <literal> data_flush</>. This configuration will cause only data
+    page modifications to wait for confirmation that standby has written
+    corresponding WAL record to durable storage.
+   </para>
+
+  </sect3>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
@@ -1201,12 +1259,28 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
    </para>
 
    <para>
-    So, switching from primary to standby server can be fast but requires
-    some time to re-prepare the failover cluster. Regular switching from
-    primary to standby is useful, since it allows regular downtime on
-    each system for maintenance. This also serves as a test of the
-    failover mechanism to ensure that it will really work when you need it.
-    Written administration procedures are advised.
+    At the time of failover there is a possibility of file-system level
+    inconsistency between old primary and old standby server hence
+    fresh backup from new master onto old master is needed for Configuring
+    old standby server as a new standby server. Without taking fresh
+    backup even if the new standby starts, streaming replication does not
+    start successfully. The activity of taking backup can be fast for smaller
+    database but for large database requires more time to re-prepare the
+    failover cluster and could break the service level agreement of crash
+    recovery. The need of fresh backup and problem of long
+    recovery time can be solved if HA cluster is configured with
+    failback safe standby see <xref linkend="failback-safe">.
+    Failback safe standby makes WAL transfer synchronous at required
+    places and maintains the file-system level consistency between
+    master and standby server and the old master server can be easily
+    configured as new standby server.
+   </para>
+
+   <para>
+    Regular switching from primary to standby is useful, since it allows
+    regular downtime on each system for maintenance. This also serves as
+    a test of the failover mechanism to ensure that it will really work
+    when you need it. Written administration procedures are advised.
    </para>
 
    <para>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 2af1738..b074a91 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1569,6 +1569,14 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
        corruption) in case of a crash of the <emphasis>database</> alone.
       </para>
      </listitem>
+
+     <listitem>
+      <para>
+       Set <xref linkend="guc-synchronous-transfer"> to commit; there is no
+       need to guard against database inconsistency between master and standby
+       server during failover.
+      </para>
+     </listitem>
     </itemizedlist>
    </para>
   </sect1>
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index cb95aa3..5759791 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -37,6 +37,8 @@
 #include "access/transam.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -708,8 +710,10 @@ WriteZeroPageXlogRec(int pageno)
 /*
  * Write a TRUNCATE xlog record
  *
- * We must flush the xlog record to disk before returning --- see notes
- * in TruncateCLOG().
+ * Before returning we must flush the xlog record to disk
+ * and if synchronous transfer is requested wait for failback
+ * safe standby to receive WAL up to recptr.
+ * --- see notes in TruncateCLOG().
  */
 static void
 WriteTruncateXlogRec(int pageno)
@@ -723,6 +727,12 @@ WriteTruncateXlogRec(int pageno)
 	rdata.next = NULL;
 	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
 	XLogFlush(recptr);
+
+	/*
+	 * Wait for failback safe standby.
+	 */
+	if (SyncTransRequested())
+		SyncRepTransferWaitForLSN(recptr, true);
 }
 
 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5e53593..b01b3b6 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,6 +54,8 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
@@ -744,6 +746,13 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 			START_CRIT_SECTION();
 			XLogFlush(max_lsn);
 			END_CRIT_SECTION();
+
+			/*
+			 * If synchronous transfer is requested, wait for failback safe
+			 * standby to receive WAL up to max_lsn.
+			 */
+			if (SyncTransRequested())
+				SyncRepTransferWaitForLSN(max_lsn, true);
 		}
 	}
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0591f3f..0a0a49c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4690,8 +4690,17 @@ xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
 	 * for any user that requested ForceSyncCommit().
 	 */
 	if (XactCompletionForceSyncCommit(xinfo))
+	{
 		XLogFlush(lsn);
 
+		/*
+		 * If synchronous transfer is requested, wait for failback safe
+		 * standby to receive WAL up to lsn,
+		 */
+		if (SyncTransRequested())
+			SyncRepTransferWaitForLSN(lsn, true);
+
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fc495d6..4c1d2ac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -39,6 +39,7 @@
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/barrier.h"
@@ -8282,6 +8283,18 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
+	 * If synchronous transfer is requested, wait for failback safe standby
+	 * to receive WAL up to checkpoint WAL record. Otherwise if failure occurs
+	 * before standby receives CHECKPOINT WAL record causes an inconsistency
+	 * between control files of master and standby. Because of this master will
+	 * start from a location which is not known to the standby at the time fail-over.
+	 *
+	 * There is no need to wait for shutdown CHECKPOINT.
+	 */
+	if (SyncTransRequested())
+		SyncRepTransferWaitForLSN(recptr, !shutdown);
+
+	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
 	smgrpostckpt();
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 971a149..70595ca 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -25,6 +25,8 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
@@ -288,6 +290,14 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		 */
 		if (fsm || vm)
 			XLogFlush(lsn);
+
+		/*
+		 * If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to lsn. Otherwise, we may have a situation where
+		 * the heap is truncated, but the action never replayed on the standby.
+		 */
+		if (SyncTransRequested())
+			SyncRepTransferWaitForLSN(lsn, true);
 	}
 
 	/* Do the real work */
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 8cf1346..fc18883 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -66,6 +66,8 @@ char	   *SyncRepStandbyNames;
 static bool announce_next_takeover = true;
 
 static int	SyncRepWaitMode = SYNC_REP_NO_WAIT;
+static int	SyncTransferMode = SYNC_REP_NO_WAIT;
+int		synchronous_transfer = SYNCHRONOUS_TRANSFER_COMMIT;
 
 static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
@@ -83,28 +85,24 @@ static bool SyncRepQueueIsOrderedByLSN(int mode);
  */
 
 /*
- * Wait for synchronous replication, if requested by user.
+ * Wait for synchronous standby, if requested by user.
  *
  * Initially backends start in state SYNC_REP_NOT_WAITING and then
- * change that state to SYNC_REP_WAITING before adding ourselves
- * to the wait queue. During SyncRepWakeQueue() a WALSender changes
- * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
- * This backend then resets its state to SYNC_REP_NOT_WAITING.
+ * change that state to SYNC_REP_WAITING/SYNC_REP_WAITING_FOR_DATA_FLUSH
+ * before adding ourselves to the wait queue. During SyncRepWakeQueue() a
+ * WALSender changes the state to SYNC_REP_WAIT_COMPLETE once replication is
+ * confirmed. This backend then resets its state to SYNC_REP_NOT_WAITING.
+ *
+ * ForDataFlush - if TRUE, we wait before flushing data page.
+ * Otherwise wait for the sync standby
  */
-void
-SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+static bool
+SyncRepWait(XLogRecPtr XactCommitLSN, bool ForDataFlush, bool Wait)
 {
-	char	   *new_status = NULL;
-	const char *old_status;
-	int			mode = SyncRepWaitMode;
-
-	/*
-	 * Fast exit if user has not requested sync replication, or there are no
-	 * sync replication standby names defined. Note that those standbys don't
-	 * need to be connected.
-	 */
-	if (!SyncRepRequested() || !SyncStandbysDefined())
-		return;
+	char		*new_status = NULL;
+	const char  *old_status;
+	int			mode = !ForDataFlush ? SyncRepWaitMode : SyncTransferMode;
+	bool		ret;
 
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 	Assert(WalSndCtl != NULL);
@@ -120,11 +118,20 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * condition but we'll be fetching that cache line anyway so it's likely to
 	 * be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if ((!ForDataFlush && !WalSndCtl->sync_standbys_defined) ||
 		XactCommitLSN <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
-		return;
+		return true;
+	}
+
+	/*
+	 * Exit if we are told not to block on the standby.
+	 */
+	if (!Wait)
+	{
+		LWLockRelease(SyncRepLock);
+		return false;
 	}
 
 	/*
@@ -151,6 +158,8 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		new_status[len] = '\0'; /* truncate off " waiting ..." */
 	}
 
+	ret = false;
+
 	/*
 	 * Wait for specified LSN to be confirmed.
 	 *
@@ -187,7 +196,10 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 			LWLockRelease(SyncRepLock);
 		}
 		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+		{
+			ret = true;
 			break;
+		}
 
 		/*
 		 * If a wait for synchronous replication is pending, we can neither
@@ -264,6 +276,65 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		set_ps_display(new_status, false);
 		pfree(new_status);
 	}
+
+	return ret;
+}
+
+/*
+ * Wait for synchronous standby on transaction commit, if requested by user.
+ */
+void
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+	/*
+	 * Fast exit If there are no sync replication standby names defined,
+	 * or sync replication is not requested. Note that those standbys don't
+	 * need to be connected.
+	 */
+	if (!SyncStandbysDefined() || !SyncRepRequested())
+		return;
+
+	/*
+	 * If synchronos_transfer is set 'data_flush', exit.
+	 * Note that it is regardless value of synchornous_commit.
+	 */
+	if (IsSyncRepSkipped())
+		return;
+
+	/* Wait for replication on transaction commit, ForDataFlush is false */
+	SyncRepWait(XactCommitLSN, false, true);
+}
+
+/*
+ * Wait for synchronous standby on data page flush, if requested by user.
+ *
+ * Wait - if FALSE, we don't actually wait, but tell the caller whether or not
+ * the standby has already made progressed upto the given XactCommitLSN.
+ *
+ * Return True if either the sycnhronous standby/failback safe standby is not
+ * configured/turned off OR the standby has made enough progress.
+ */
+bool
+SyncRepTransferWaitForLSN(XLogRecPtr XactCommitLSN, bool Wait)
+{
+	bool ret;
+
+	/*
+	 * Fast exit if there are no sync replication standby names defined,
+	 * or sync transfer is not requested. Note that those standby's don't
+	 * need to be connected.
+	 */
+	if (!SyncStandbysDefined() || !SyncTransRequested())
+		return true;
+
+	/* System still in recovery, exit */
+	if (RecoveryInProgress())
+		return true;
+
+	/* Wait for replication on data pafe flush, ForDataFlush is true */
+	ret = SyncRepWait(XactCommitLSN, true, Wait);
+
+	return ret;
 }
 
 /*
@@ -710,3 +781,18 @@ assign_synchronous_commit(int newval, void *extra)
 			break;
 	}
 }
+
+void
+assign_synchronous_transfer(int newval, void *extra)
+{
+	switch (newval)
+	{
+		case SYNCHRONOUS_TRANSFER_ALL:
+		case SYNCHRONOUS_TRANSFER_DATA_FLUSH:
+			SyncTransferMode = SYNC_REP_WAIT_FLUSH;
+			break;
+		default:
+			SyncTransferMode = SYNC_REP_NO_WAIT;
+			break;
+	}
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f848391..262509b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -41,6 +41,8 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -1975,8 +1977,14 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 	 * skip the flush if the buffer isn't permanent.
 	 */
 	if (buf->flags & BM_PERMANENT)
+	{
 		XLogFlush(recptr);
-
+		/* If synchronous transfer is requested, wait for failback safe standby
+		 * to receive WAL up to recptr.
+		 */
+		if (SyncTransRequested())
+			SyncRepTransferWaitForLSN(recptr, true);
+	}
 	/*
 	 * Now it's safe to write buffer to disk. Note that no one else should
 	 * have been able to write it while we were busy with log flushing because
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 18f0342..9e81edb 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -48,6 +48,8 @@
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
@@ -711,6 +713,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	int			fd;
 	RelMapFile *realmap;
 	char		mapfilename[MAXPGPATH];
+	XLogRecPtr	lsn=InvalidXLogRecPtr;
 
 	/*
 	 * Fill in the overhead fields and update CRC.
@@ -753,7 +756,6 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	{
 		xl_relmap_update xlrec;
 		XLogRecData rdata[2];
-		XLogRecPtr	lsn;
 
 		/* now errors are fatal ... */
 		START_CRIT_SECTION();
@@ -849,6 +851,13 @@ write_relmap_file(bool shared, RelMapFile *newmap,
 	/* Critical section done */
 	if (write_wal)
 		END_CRIT_SECTION();
+
+	/*
+	 * If synchronous transfer is requested, wait for failback safe
+	 * standby to receive WAL up to recptr.
+	 */
+	if (SyncTransRequested() && write_wal)
+		SyncRepTransferWaitForLSN(lsn, true);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3107f9c..a92dba4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -381,6 +381,16 @@ static const struct config_enum_entry synchronous_commit_options[] = {
 };
 
 /*
+ * Although only "all", "data_flush", and "commit" are documented
+ */
+static const struct config_enum_entry synchronous_transfer_options[] = {
+	{"all", SYNCHRONOUS_TRANSFER_ALL, false},
+	{"data_flush", SYNCHRONOUS_TRANSFER_DATA_FLUSH, false},
+	{"commit", SYNCHRONOUS_TRANSFER_COMMIT, false},
+	{NULL, 0, false}
+};
+
+/*
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
@@ -3300,6 +3310,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"synchronous_transfer", PGC_SIGHUP, WAL_SETTINGS,
+			gettext_noop("Sets the data flush synchronization level"),
+			NULL
+		},
+		&synchronous_transfer,
+		SYNCHRONOUS_TRANSFER_COMMIT, synchronous_transfer_options,
+		NULL, assign_synchronous_transfer, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..ababb81 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -220,6 +220,8 @@
 #synchronous_standby_names = ''	# standby servers that provide sync rep
 				# comma-separated list of application_name
 				# from standby(s); '*' = all
+#synchronous_transfer = commit	# specifies when to wait for replication
+				# commit, data_flush or all
 #vacuum_defer_cleanup_age = 0	# number of xacts by which cleanup is delayed
 
 # - Standby Servers -
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index ed66c49..050b6c9 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -60,6 +60,8 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "replication/walsender.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/procarray.h"
 #include "utils/tqual.h"
@@ -115,6 +117,18 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 
 		if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer))
 			return;				/* not flushed yet, so don't set hint */
+
+		/*
+		 * If synchronous transfer is requested, we check if the commit WAL record
+		 * has made to the standby before allowing hint bit updates. We should not
+		 * wait for the standby to receive the WAL since its OK to delay hint bit
+		 * updates.
+		 */
+		if (SyncTransRequested())
+		{
+			if(!SyncRepTransferWaitForLSN(commitLSN, false))
+				return;
+		}
 	}
 
 	tuple->t_infomask |= infomask;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index ac23ea6..b7b9e6d 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -19,6 +19,12 @@
 #define SyncRepRequested() \
 	(max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
 
+#define SyncTransRequested() \
+	(max_wal_senders > 0 && synchronous_transfer > SYNCHRONOUS_TRANSFER_COMMIT)
+
+#define IsSyncRepSkipped() \
+	(max_wal_senders > 0 && synchronous_transfer ==  SYNCHRONOUS_TRANSFER_DATA_FLUSH)
+
 /* SyncRepWaitMode */
 #define SYNC_REP_NO_WAIT		-1
 #define SYNC_REP_WAIT_WRITE		0
@@ -31,11 +37,23 @@
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
 
+typedef enum
+{
+	SYNCHRONOUS_TRANSFER_COMMIT,		/* Wait for replication on transaciton commit */
+	SYNCHRONOUS_TRANSFER_DATA_FLUSH,	/* Wait for replication on data page flush */
+	SYNCHRONOUS_TRANSFER_ALL	        /* Wait for replication on transaction commit  and
+										 * data page flush*/
+}	SynchronousTransferLevel;
+
 /* user-settable parameters for synchronous replication */
 extern char *SyncRepStandbyNames;
 
+/* user-settable parameters for failback safe replication */
+extern int	synchronous_transfer;
+
 /* called by user backend */
 extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+extern bool SyncTransWaitForLSN(XLogRecPtr XactCommitLSN, bool Wait);
 
 /* called at backend exit */
 extern void SyncRepCleanupAtProcExit(void);
@@ -52,5 +70,6 @@ extern int	SyncRepWakeQueue(bool all, int mode);
 
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
 extern void assign_synchronous_commit(int newval, void *extra);
+extern void assign_synchronous_transfer(int newval, void *extra);
 
 #endif   /* _SYNCREP_H */

#85

Pavan Deolasee

pavan.deolasee@gmail.com

over 12 years ago

In reply to: Sawada Masahiko (#84)

Re: Patch for fail-back without fresh backup

On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:

On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I attached the v12 patch which have modified based on above suggestions.

There are still some parts of this design/patch which I am concerned about.

1. The design clubs synchronous standby and failback safe standby rather
very tightly. IIRC this is based on the feedback you received early, so my
apologies for raising it again so late.
a. GUC synchrnous_standby_names is used to name synchronous as well as
failback safe standbys. I don't know if that will confuse users.
b. synchronous_commit's value will also control whether a sync/async
failback safe standby wait for remote write or flush. Is that reasonable ?
Or should there be a different way to configure the failback safe standby's
WAL safety ?

2. With the current design/implementation, user can't configure a
synchronous and an async failback safe standby at the same time. I think we
discussed this earlier and there was an agreement on the limitation. Just
wanted to get that confirmed again.

3. SyncRepReleaseWaiters() does not know whether its waking up backends
waiting for sync rep or failback safe rep. Is that ok ? For example, I
found that the elog() message announcing next takeover emitted by the
function may look bad. Since changing synchronous_transfer requires server
restart, we can teach SyncRepReleaseWaiters() to look at that parameter to
figure out whether the standby is sync and/or failback safe standby.

4. The documentation still need more work to clearly explain the use case.

5. Have we done any sort of stress testing of the patch ? If there is a
bug, the data corruption at the master can go unnoticed. So IMHO we need
many crash recovery tests to ensure that the patch is functionally correct.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#86

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Pavan Deolasee (#85)

Re: Patch for fail-back without fresh backup

On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote:

On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:

On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I attached the v12 patch which have modified based on above suggestions.

There are still some parts of this design/patch which I am concerned about.

1. The design clubs synchronous standby and failback safe standby rather
very tightly. IIRC this is based on the feedback you received early, so my
apologies for raising it again so late.

It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Pavan Deolasee

pavan.deolasee@gmail.com

over 12 years ago

In reply to: Andres Freund (#86)

Re: Patch for fail-back without fresh backup

On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund <andres@2ndquadrant.com>wrote:

It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.

IIRC you and Tom were particularly skeptical about the approach. But do you
see a technical flaw or a show stopper with the approach ? Heikki has
written pg_rewind which is really very cool. But it fails to handle the
hint bit updates which are not WAL logged unless of course checksums are
turned on. We can have a GUC controlled option to turn WAL logging on for
hint bit updates and then use pg_rewind for the purpose. But I did not see
any agreement on that either. Performance implication of WAL logging every
hint bit update could be huge.

Simon has raised usability concerns that Sawada-san and Samrat have tried
to address by following his suggestions. I am not fully convinced though we
have got that right. But then there is hardly any feedback on that aspect
lately.

In general, from the discussion it seems that the patch is trying to solve
a real problem. Even though Tom and you feel that rsync is probably good
enough and more trustworthy than any other approach, my feeling is that
many including Fujii-san still disagree with that argument based on real
user feedback. So where do we go from here ? I think it will really help
Sawada-san and Samrat if we can provide them some solid feedback and
approach to take.

Lately, I was thinking if we could do something else to track file system
updates without relying on WAL inspection and then use pg_rewind to solve
this problem. Some sort of prelaod library mechanism is one such
possibility. But haven't really thought through this entirely.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#88

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Pavan Deolasee (#87)

Re: Patch for fail-back without fresh backup

On 08.10.2013 13:00, Pavan Deolasee wrote:

On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund<andres@2ndquadrant.com>wrote:

It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.

IIRC you and Tom were particularly skeptical about the approach. But do you
see a technical flaw or a show stopper with the approach ? Heikki has
written pg_rewind which is really very cool. But it fails to handle the
hint bit updates which are not WAL logged unless of course checksums are
turned on. We can have a GUC controlled option to turn WAL logging on for
hint bit updates and then use pg_rewind for the purpose. But I did not see
any agreement on that either. Performance implication of WAL logging every
hint bit update could be huge.

Yeah, I definitely think we should work on the pg_rewind approach
instead of this patch. It's a lot more flexible. The performance hit of
WAL-logging hint bit updates is the price you have to pay, but a lot of
people were OK with that to get page checksum, so I think a lot of
people would be OK with it for this purpose too. As long as it's
optional, of course. And anyone using page checksums are already paying
that price.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Pavan Deolasee (#85)

Re: Patch for fail-back without fresh backup

On Tue, Oct 8, 2013 at 6:37 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:

On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I attached the v12 patch which have modified based on above suggestions.

There are still some parts of this design/patch which I am concerned about.

1. The design clubs synchronous standby and failback safe standby rather
very tightly. IIRC this is based on the feedback you received early, so my
apologies for raising it again so late.
a. GUC synchrnous_standby_names is used to name synchronous as well as
failback safe standbys. I don't know if that will confuse users.

With currently the patch, user can specify the failback safe standby
and sync replication standby at same server.
synchronous_standby_names
So I was thinking that it will not confuse users.

b. synchronous_commit's value will also control whether a sync/async
failback safe standby wait for remote write or flush. Is that reasonable ?
Or should there be a different way to configure the failback safe standby's
WAL safety ?

synchronous_commit's values can not control whether sync sync/async
failback safe standby wait level.
On data page flush, failback safe standby waits for only flush.
Should we also allow to wait for remote write?

2. With the current design/implementation, user can't configure a
synchronous and an async failback safe standby at the same time. I think we
discussed this earlier and there was an agreement on the limitation. Just
wanted to get that confirmed again.

yes, user can't configure sync standby and async failback safe standby
at the same time.
The currently patch supports following cases.
- sync standby and make same as failback safe standby
- async standby and make same as failback safe standby

3. SyncRepReleaseWaiters() does not know whether its waking up backends
waiting for sync rep or failback safe rep. Is that ok ? For example, I found
that the elog() message announcing next takeover emitted by the function may
look bad. Since changing synchronous_transfer requires server restart, we
can teach SyncRepReleaseWaiters() to look at that parameter to figure out
whether the standby is sync and/or failback safe standby.

I agree with you.
Are you saying about following comment?
if (announce_next_takeover)
{
announce_next_takeover = false;
ereport(LOG,
(errmsg("standby \"%s\" is now the synchronous standby
with priority %u",
application_name,
MyWalSnd->sync_standby_priority)));
}

4. The documentation still need more work to clearly explain the use case.

Understood. we will more work to clearly explain the use case.

5. Have we done any sort of stress testing of the patch ? If there is a bug,
the data corruption at the master can go unnoticed. So IMHO we need many
crash recovery tests to ensure that the patch is functionally correct.

I have done several testing of the patch. And I have confirmed that
data page is not flushed to disk
when the master server has not been receive the reply from the standby
server. I used pg_filedump.
To ensure that the patch is functionally correct, what test should we do?

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Sawada Masahiko

sawada.mshk@gmail.com

over 12 years ago

In reply to: Andres Freund (#86)

Re: Patch for fail-back without fresh backup

On Tue, Oct 8, 2013 at 6:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote:

On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:

On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I attached the v12 patch which have modified based on above suggestions.

There are still some parts of this design/patch which I am concerned about.

1. The design clubs synchronous standby and failback safe standby rather
very tightly. IIRC this is based on the feedback you received early, so my
apologies for raising it again so late.

It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.

Thank you for comment.
We are thinking that this approach can solve the real problem.
Actually we have confirm the effect of this approach. The master
server flushes data page to disk
after the master server received reply from the standby server.
If you have concern or doubt in technical side, Could you tell me it?

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91

Samrat Revagade

revagade.samrat@gmail.com

over 12 years ago

In reply to: Andres Freund (#86)

Re: Patch for fail-back without fresh backup

On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote:

On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com
wrote:

On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com>

wrote:

I attached the v12 patch which have modified based on above

suggestions.

There are still some parts of this design/patch which I am concerned

about.

1. The design clubs synchronous standby and failback safe standby rather
very tightly. IIRC this is based on the feedback you received early, so

my

apologies for raising it again so late.

It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.

Listing down all objections and their solutions:

Major Objection on the proposal:
* Tom Lane*
# additional complexity to the code it will cause performance overhead - On
an average it causes 0.5 - 1% performance overhead for fast transaction
workload, as the wait is mostly on backend process. The latest re-factored
code, looks less complex.
# Use of rsync with checksum - but many pages on the two servers may differ
in their binary values because of hint bits

*Heikki :*
# Use pg_rewind to do the same:
It has well known problem of hint bit updates.
If we use this we need enable checksums or explicitly WAL log hint bits
which leads to performance overhead

*Amit Kapila*
# How to take care of extra WAL on old master during recovery.?
we can solve this by deleting all WAL file when old master before it starts
as new standby.

*Simon Riggs*
# Renaming patch - done
# remove extra set of parameters - done
# performance drop - On an average it causes 0.5 - 1% performance overhead
for fast transaction workload, as the wait is mostly on backend process.
# The way of configuring standby - with synchronous_transfer parameter we
can configure 4 types of standby servers depending on the need.

*Fujii Masao*
# how patch interacts with cascaded standby - patch works same as
synchronous replication
# CHECKPOINT in the standby, it got stuck infinitely. - fixed this
# Complicated conditions in SyncRepWaitForLSN() – code has been refactored
in v11
# Improve source code comments - done

*Pavan Deolasee*
# Interaction of synchronous_commit with synchronous_transfer
- Now synchronous_commit only controls whether and how
to wait for the standby only when a transaction
commits. synchronous_transfer OTOH tells how to interpret the standby
listed in
synchronous_standbys parameter.
# Further Improvements in the documentation - we will do that
# More stress testing - we will do that

Any inputs on stress testing would help.

#92

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Samrat Revagade (#91)

Re: Patch for fail-back without fresh backup

On Wed, Oct 9, 2013 at 4:54 AM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:

On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote:

On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko
<sawada.mshk@gmail.com>wrote:

On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

I attached the v12 patch which have modified based on above
suggestions.

There are still some parts of this design/patch which I am concerned
about.

1. The design clubs synchronous standby and failback safe standby rather
very tightly. IIRC this is based on the feedback you received early, so
my
apologies for raising it again so late.

It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.

Listing down all objections and their solutions:

Major Objection on the proposal:
* Tom Lane*
# additional complexity to the code it will cause performance overhead - On
an average it causes 0.5 - 1% performance overhead for fast transaction
workload, as the wait is mostly on backend process. The latest re-factored
code, looks less complex.
# Use of rsync with checksum - but many pages on the two servers may differ
in their binary values because of hint bits

*Heikki :*
# Use pg_rewind to do the same:
It has well known problem of hint bit updates.
If we use this we need enable checksums or explicitly WAL log hint bits
which leads to performance overhead

*Amit Kapila*
# How to take care of extra WAL on old master during recovery.?
we can solve this by deleting all WAL file when old master before it starts
as new standby.

*Simon Riggs*
# Renaming patch - done
# remove extra set of parameters - done
# performance drop - On an average it causes 0.5 - 1% performance overhead
for fast transaction workload, as the wait is mostly on backend process.
# The way of configuring standby - with synchronous_transfer parameter we
can configure 4 types of standby servers depending on the need.

*Fujii Masao*
# how patch interacts with cascaded standby - patch works same as
synchronous replication
# CHECKPOINT in the standby, it got stuck infinitely. - fixed this
# Complicated conditions in SyncRepWaitForLSN() – code has been refactored
in v11
# Improve source code comments - done

*Pavan Deolasee*
# Interaction of synchronous_commit with synchronous_transfer - Now
synchronous_commit only controls whether and how

to wait for the standby only when a transaction commits.
synchronous_transfer OTOH tells how to interpret the standby listed in
synchronous_standbys parameter.
# Further Improvements in the documentation - we will do that
# More stress testing - we will do that

Any inputs on stress testing would help.

The point is that when there are at least four senior community
members expressing serious objections to a concept, three of whom are
committes, we shouldn't be considering committing it until at least
some of those people have withdrawn your objections. Nearly all patch
submitters are in favor of their own patches; that does not entitle
them to have those patches, committed even if there is a committer who
agrees with them. There needs to be a real consensus on the path
forward. If that policy ever changes, I have my own list of things
that are on the cutting-room floor that I'll be happy to resurrect.

Personally, I don't have a strong opinion on this patch because I have
not followed it closely enough. But if Tom, Heikki, Simon, and Andres
are all unconvinced that this is a good direction, then put me down
for a -1 vote as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Pavan Deolasee

pavan.deolasee@gmail.com

over 12 years ago

In reply to: Heikki Linnakangas (#88)

Re: Patch for fail-back without fresh backup

On Tue, Oct 8, 2013 at 9:22 PM, Heikki Linnakangas
<hlinnakangas@vmware.com>wrote:

Yeah, I definitely think we should work on the pg_rewind approach instead
of this patch. It's a lot more flexible. The performance hit of WAL-logging
hint bit updates is the price you have to pay, but a lot of people were OK
with that to get page checksum, so I think a lot of people would be OK with
it for this purpose too. As long as it's optional, of course. And anyone
using page checksums are already paying that price.

Not that I can find any flaw in the OP's patch, but given the major
objections and my own nervousness about documenting this new "failback
safe" standby mode, I am also inclining to improve pg_rewind or whatever it
takes to get it working. Clearly at first we need to have an optional
mechanism to WAL log hint bit updates. There seems to be two ways to do
that:

a. Add a new GUC which can turned on/off and requires server restart to
take effect
b. Add another option for wal_level setting.

(b) looks better, but I am not sure if we want to support this new level
with and without hot standby. If latter, we will need multiple new levels
to differentiate all those cases. I am OK with supporting it only with hot
standby which is probably what most people do with streaming replication
anyway.

The other issue is to how to optimally WAL log hint bit updates:

a. Should we have separate WAL records just for the purpose or should we
piggyback them on heap update/delete/prune etc WAL records ? Of course,
there will be occasions when a simple SELECT also updates hint bits, so
most likely we will need a separate WAL record anyhow.
b. Does it make sense to try to all hint bits in a page if we are WAL
logging it anyways ? I think we have discussed this idea even before just
to minimize the number of writes a heap page receives when hint bits of
different tuples are set at different times, each update triggering a fresh
write. I don't remember whats the consensus for that, but it might be
worthwhile to reconsider that option if we are WAL logging the hint bit
updates.

We will definitely need some amount of performance benchmarks even if this
is optional. But are there other things to worry about ? Any strong
objections to this idea or any other stow stopper for pg_rewind itself ?

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#94

Sawada Masahiko

sawada.mshk@gmail.com

about 12 years ago

In reply to: Pavan Deolasee (#93)

Re: Patch for fail-back without fresh backup

On Thu, Oct 10, 2013 at 1:41 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Not that I can find any flaw in the OP's patch, but given the major
objections and my own nervousness about documenting this new "failback safe"
standby mode, I am also inclining to improve pg_rewind or whatever it takes
to get it working. Clearly at first we need to have an optional mechanism to
WAL log hint bit updates. There seems to be two ways to do that:

a. Add a new GUC which can turned on/off and requires server restart to take
effect
b. Add another option for wal_level setting.

(b) looks better, but I am not sure if we want to support this new level
with and without hot standby. If latter, we will need multiple new levels to
differentiate all those cases. I am OK with supporting it only with hot
standby which is probably what most people do with streaming replication
anyway.

The other issue is to how to optimally WAL log hint bit updates:

a. Should we have separate WAL records just for the purpose or should we
piggyback them on heap update/delete/prune etc WAL records ? Of course,
there will be occasions when a simple SELECT also updates hint bits, so most
likely we will need a separate WAL record anyhow.
b. Does it make sense to try to all hint bits in a page if we are WAL
logging it anyways ? I think we have discussed this idea even before just to
minimize the number of writes a heap page receives when hint bits of
different tuples are set at different times, each update triggering a fresh
write. I don't remember whats the consensus for that, but it might be
worthwhile to reconsider that option if we are WAL logging the hint bit
updates.

I agree with you.
If writing FPW is not large performance degradation, it is just idea
that we can use to write FPW in same timing as checksum enabled.
i.g., if we support new wal_level, the system writes FPW when a simple
SELECT updates hint bits. but checksum function is disabled.
Thought?

We will definitely need some amount of performance benchmarks even if this
is optional. But are there other things to worry about ? Any strong
objections to this idea or any other stow stopper for pg_rewind itself ?

--
Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Pavan Deolasee

pavan.deolasee@gmail.com

about 12 years ago

In reply to: Sawada Masahiko (#94)

Re: Patch for fail-back without fresh backup

On Mon, Oct 21, 2013 at 7:10 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:

I agree with you.
If writing FPW is not large performance degradation, it is just idea
that we can use to write FPW in same timing as checksum enabled.
i.g., if we support new wal_level, the system writes FPW when a simple
SELECT updates hint bits. but checksum function is disabled.
Thought?

I wonder if its too much for this purpose. In fact, we just need a way to
know that a block could have been written on the master which the standby
never saw. So even WAL logging just the block id should be good enough for
pg_rewind to be able to detect and later copy that block from the new
master. Having said that, I don't know if there is general advantage of WAL
logging the exact hint bit update operation for other reasons.

Another difference AFAICS is that checksum feature needs the block to be
backed up only after the first time a hint bit is updated after checkpoint.
But for something like pg_rewind to work, we will need to WAL log every
hint bit update on a page. So we would want to keep it as short as possible.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#96

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Pavan Deolasee (#95)

Re: Patch for fail-back without fresh backup

On 24.10.2013 13:02, Pavan Deolasee wrote:

Another difference AFAICS is that checksum feature needs the block to be
backed up only after the first time a hint bit is updated after checkpoint.
But for something like pg_rewind to work, we will need to WAL log every
hint bit update on a page. So we would want to keep it as short as possible.

To fix that, pg_rewind could always start the rewinding process from the
last checkpoint before the point that the histories diverge, instead of
the exact point of divergence. That would make the rewinding more
expensive as it needs to read through a lot more WAL, but I think it
would still be OK.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97

Pavan Deolasee

pavan.deolasee@gmail.com

about 12 years ago

In reply to: Heikki Linnakangas (#96)

Re: Patch for fail-back without fresh backup

On Thu, Oct 24, 2013 at 4:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

.

To fix that, pg_rewind could always start the rewinding process from the
last checkpoint before the point that the histories diverge, instead of the
exact point of divergence.

Is that something required even if someone plans to use pg_rewind for a
cluster with checksums enabled ? I mean since only first update after
checkpoint is WAL logged, pg_rewind will break if another update happens
after standby forks. Or would the recovery logic apply first WAL without
looking at the page lsn ? (Sorry, may be I should read the code instead of
asking you)

If we do what you are suggesting, it seems like a single line patch to me.
In XLogSaveBufferForHint(), we probably need to look at this additional GUC
to decide whether or not to backup the block.

That would make the rewinding more expensive as it needs to read through a

lot more WAL, but I think it would still be OK.

Yeah, probably you are right. Though the amount of additional work could be
significantly higher and some testing might be warranted.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#98

Pavan Deolasee

pavan.deolasee@gmail.com

about 12 years ago

In reply to: Pavan Deolasee (#97)

Re: Patch for fail-back without fresh backup

On Thu, Oct 24, 2013 at 4:45 PM, Pavan Deolasee <pavan.deolasee@gmail.com>wrote:

. Or would the recovery logic apply first WAL without looking at the page
lsn ? (Sorry, may be I should read the code instead of asking you)

Never mind. I realized it has to. That's the whole purpose of backing it up
in the first place.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#99

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Pavan Deolasee (#97)

Re: Patch for fail-back without fresh backup

On 24.10.2013 14:15, Pavan Deolasee wrote:

On Thu, Oct 24, 2013 at 4:22 PM, Heikki Linnakangas<hlinnakangas@vmware.com

wrote:

To fix that, pg_rewind could always start the rewinding process from the
last checkpoint before the point that the histories diverge, instead of the
exact point of divergence.

Is that something required even if someone plans to use pg_rewind for a
cluster with checksums enabled ? I mean since only first update after
checkpoint is WAL logged, pg_rewind will break if another update happens
after standby forks.

Yes. It's broken as it is, even when checksums are enabled - good catch.
I'll go change it to read all the WAL in the target starting from the
last checkpoint before the point of divergence.

Or would the recovery logic apply first WAL without
looking at the page lsn ? (Sorry, may be I should read the code instead of
asking you)

WAL recovery does apply all the full-page images without looking at the
page LSN, but that doesn't help in this case. pg_rewind copies over the
blocks from the source server (= promoted standby) that were changed in
the target server (= old master), after the standby's history diverged
from it. In other words, it reverts the blocks that were changed in the
old master, by copying them over from the promoted standby. After that,
WAL recovery is performed, using the WAL from the promoted standby, to
apply all the changes from the promoted standby that were not present in
the old master. But it never replays any WAL from the old master. It
reads it through, to construct the list of blocks that were modified,
but it doesn't apply them.

If we do what you are suggesting, it seems like a single line patch to me.
In XLogSaveBufferForHint(), we probably need to look at this additional GUC
to decide whether or not to backup the block.

Yeah, it's trivial to add such a guc. Will just have to figure out what
we want the user interface to be like; should it be a separate guc, or
somehow cram it into wal_level?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Pavan Deolasee

pavan.deolasee@gmail.com

about 12 years ago

In reply to: Heikki Linnakangas (#99)

Re: Patch for fail-back without fresh backup

On Thu, Oct 24, 2013 at 5:45 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

Will just have to figure out what we want the user interface to be like;
should it be a separate guc, or somehow cram it into wal_level?

Yeah, I had brought up similar idea up thread. Right now wal_level is
nicely ordered. But with this additional logic, I am not sure if we would
need multiple new levels and also break that ordering (I don't know if its
important). For example, one may want to set up streaming replication
with/without this feature or hot standby with/without the feature. I don't
have a good idea about how to capture them in wal_level. May be something
like: minimal, archive, archive_with_this_new_feature, hot_standby and
hot_standby_with_this_new_feature.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#101

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Pavan Deolasee (#100)

Re: Patch for fail-back without fresh backup

Pavan Deolasee escribiï¿½:

Yeah, I had brought up similar idea up thread. Right now wal_level is
nicely ordered. But with this additional logic, I am not sure if we would
need multiple new levels and also break that ordering (I don't know if its
important). For example, one may want to set up streaming replication
with/without this feature or hot standby with/without the feature. I don't
have a good idea about how to capture them in wal_level. May be something
like: minimal, archive, archive_with_this_new_feature, hot_standby and
hot_standby_with_this_new_feature.

That's confusing. A separate GUC sounds better.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102

Josh Berkus

josh@agliodbs.com

about 12 years ago

In reply to: Sawada Masahiko (#72)

Re: Patch for fail-back without fresh backup

On 10/24/2013 04:15 AM, Pavan Deolasee wrote:

If we do what you are suggesting, it seems like a single line patch to me.
In XLogSaveBufferForHint(), we probably need to look at this additional GUC
to decide whether or not to backup the block.

Wait, what? Why are we having an additional GUC?

I'm opposed to the idea of having a GUC to enable failback. When would
anyone using replication ever want to disable that?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM4d00486b7ca6ace22739951dae93832dc1b7a6dd62ee34566bfc8f803ef111cfff7fd92f49425e9657c69a539265d1d0@asav-2.01.com

#103

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Josh Berkus (#102)

Re: Patch for fail-back without fresh backup

On 24.10.2013 20:39, Josh Berkus wrote:

On 10/24/2013 04:15 AM, Pavan Deolasee wrote:

If we do what you are suggesting, it seems like a single line patch to me.
In XLogSaveBufferForHint(), we probably need to look at this additional GUC
to decide whether or not to backup the block.

Wait, what? Why are we having an additional GUC?

I'm opposed to the idea of having a GUC to enable failback. When would
anyone using replication ever want to disable that?

For example, if you're not replicating for high availability purposes,
but to keep a reporting standby up-to-date.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Josh Berkus

josh@agliodbs.com

about 12 years ago

In reply to: Sawada Masahiko (#72)

Re: Patch for fail-back without fresh backup

On 10/24/2013 11:12 AM, Heikki Linnakangas wrote:

On 24.10.2013 20:39, Josh Berkus wrote:

On 10/24/2013 04:15 AM, Pavan Deolasee wrote:

If we do what you are suggesting, it seems like a single line patch
to me.
In XLogSaveBufferForHint(), we probably need to look at this
additional GUC
to decide whether or not to backup the block.

Wait, what? Why are we having an additional GUC?

I'm opposed to the idea of having a GUC to enable failback. When would
anyone using replication ever want to disable that?

For example, if you're not replicating for high availability purposes,
but to keep a reporting standby up-to-date.

What kind of overhead are we talking about here? You probably said, but
I've had a mail client meltdown and lost a lot of my -hackers emails.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM8572b20f013a521a22b8b19ab329f8bb3e82e5d2689cf1729c0f672559c8ebb515f380d4ab549e73eb635a9bb7476148@asav-2.01.com

#105

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Josh Berkus (#104)

Re: Patch for fail-back without fresh backup

On 24.10.2013 23:07, Josh Berkus wrote:

On 10/24/2013 11:12 AM, Heikki Linnakangas wrote:

On 24.10.2013 20:39, Josh Berkus wrote:

On 10/24/2013 04:15 AM, Pavan Deolasee wrote:

If we do what you are suggesting, it seems like a single line patch
to me.
In XLogSaveBufferForHint(), we probably need to look at this
additional GUC
to decide whether or not to backup the block.

Wait, what? Why are we having an additional GUC?

I'm opposed to the idea of having a GUC to enable failback. When would
anyone using replication ever want to disable that?

For example, if you're not replicating for high availability purposes,
but to keep a reporting standby up-to-date.

What kind of overhead are we talking about here? You probably said, but
I've had a mail client meltdown and lost a lot of my -hackers emails.

One extra WAL record whenever a hint bit is set on a page, for the first
time after a checkpoint. In other words, a WAL record needs to be
written in the same circumstances as with page checksums, but the WAL
records are much smaller as they don't need to contain a full page
image, just the block number of the changed block.

Or maybe we'll write the full page image after all, like with page
checksums, just without calculating the checksums. It might be tricky to
skip the full-page image, because then a subsequent change of the page
(which isn't just a hint-bit update) needs to somehow know it needs to
take a full page image even though a WAL record for it was already written.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Josh Berkus

josh@agliodbs.com

about 12 years ago

In reply to: Sawada Masahiko (#72)

Re: Patch for fail-back without fresh backup

On 10/24/2013 01:14 PM, Heikki Linnakangas wrote:

One extra WAL record whenever a hint bit is set on a page, for the first
time after a checkpoint. In other words, a WAL record needs to be
written in the same circumstances as with page checksums, but the WAL
records are much smaller as they don't need to contain a full page
image, just the block number of the changed block.

Or maybe we'll write the full page image after all, like with page
checksums, just without calculating the checksums. It might be tricky to
skip the full-page image, because then a subsequent change of the page
(which isn't just a hint-bit update) needs to somehow know it needs to
take a full page image even though a WAL record for it was already written.

I think it would be worth estimating what this actually looks like in
terms of log write quantity. My inclication is to say that if it
increases log writes less than 10%, we don't need to provide an option
to turn it off.

The reasons I don't want to provide a disabling GUC are:
a) more GUCs
b) confusing users
c) causing users to disable rewind *until they need it*, at which point
it's too late to enable it.

So if there's any way we can avoid having a GUC for this, I'm for it.
And if we do have a GUC, failback should be enabled by default.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM101e3fbfaa95ffe02d0a71f54d09e46e2ed46ac1fc40fb09ab6c7b43bdb071543777cb04e1ba598ed8ceb04326a3a3ca@asav-3.01.com

#107

Magnus Hagander

magnus@hagander.net

about 12 years ago

In reply to: Josh Berkus (#106)

Re: Patch for fail-back without fresh backup

On Thu, Oct 24, 2013 at 10:51 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 10/24/2013 01:14 PM, Heikki Linnakangas wrote:

One extra WAL record whenever a hint bit is set on a page, for the first
time after a checkpoint. In other words, a WAL record needs to be
written in the same circumstances as with page checksums, but the WAL
records are much smaller as they don't need to contain a full page
image, just the block number of the changed block.

Or maybe we'll write the full page image after all, like with page
checksums, just without calculating the checksums. It might be tricky to
skip the full-page image, because then a subsequent change of the page
(which isn't just a hint-bit update) needs to somehow know it needs to
take a full page image even though a WAL record for it was already written.

I think it would be worth estimating what this actually looks like in
terms of log write quantity. My inclication is to say that if it
increases log writes less than 10%, we don't need to provide an option
to turn it off.

The reasons I don't want to provide a disabling GUC are:
a) more GUCs
b) confusing users
c) causing users to disable rewind *until they need it*, at which point
it's too late to enable it.

So if there's any way we can avoid having a GUC for this, I'm for it.
And if we do have a GUC, failback should be enabled by default.

+1 on the principle.

In fact I've been considering suggesting we might want to retire the
difference between archive and hot_standby as wal_level, because the
difference is usually so small. And the advantage of hot_standby is in
almost every case worth it. Even in the archive recovery mode, being
able to do pause_at_recovery_target is extremely useful. And as you
say in (c) above, many users don't realize that until it's too late.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108

Sawada Masahiko

sawada.mshk@gmail.com

about 12 years ago

In reply to: Magnus Hagander (#107)

Re: Patch for fail-back without fresh backup

On Fri, Oct 25, 2013 at 5:57 AM, Magnus Hagander <magnus@hagander.net> wrote:

On Thu, Oct 24, 2013 at 10:51 PM, Josh Berkus <josh@agliodbs.com> wrote:

On 10/24/2013 01:14 PM, Heikki Linnakangas wrote:

I think it would be worth estimating what this actually looks like in
terms of log write quantity. My inclication is to say that if it
increases log writes less than 10%, we don't need to provide an option
to turn it off.

The reasons I don't want to provide a disabling GUC are:
a) more GUCs
b) confusing users
c) causing users to disable rewind *until they need it*, at which point
it's too late to enable it.

So if there's any way we can avoid having a GUC for this, I'm for it.
And if we do have a GUC, failback should be enabled by default.

+1 on the principle.

In fact I've been considering suggesting we might want to retire the
difference between archive and hot_standby as wal_level, because the
difference is usually so small. And the advantage of hot_standby is in
almost every case worth it. Even in the archive recovery mode, being
able to do pause_at_recovery_target is extremely useful. And as you
say in (c) above, many users don't realize that until it's too late.

+1.

Many user would not realize that it is too late If we will provide it
as additional GUC.
And I agree with writing log including the block number of the changed block.
I think that writing log is not lead huge overhead increase.
Is those WAL record replicated to the standby server in synchronous (
of course when configuring sync replication)?
I am concerned that it lead performance overhead with such as
executing SELECT or auto vacuum. especially, when two servers are in
far location.

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109

Michael Paquier

michael.paquier@gmail.com

about 12 years ago

In reply to: Magnus Hagander (#107)

Re: Patch for fail-back without fresh backup

On Fri, Oct 25, 2013 at 5:57 AM, Magnus Hagander <magnus@hagander.net> wrote:

In fact I've been considering suggesting we might want to retire the
difference between archive and hot_standby as wal_level, because the
difference is usually so small. And the advantage of hot_standby is in
almost every case worth it. Even in the archive recovery mode, being
able to do pause_at_recovery_target is extremely useful. And as you
say in (c) above, many users don't realize that until it's too late.

+1 on removing archive from wal_level. Having both archive and
hot_standby for wal_level is confusing, and if I recall correctly
hot_standby and archive have been kept as possible settings only to
protect people from bugs that the newly-introduced hot_standby could
introduce due to the few WAL records it adds. But it has been a couple
of releases since there have been no such bugs, no?
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Josh Berkus (#106)

Re: Patch for fail-back without fresh backup

On 2013-10-24 13:51:52 -0700, Josh Berkus wrote:

On 10/24/2013 01:14 PM, Heikki Linnakangas wrote:

One extra WAL record whenever a hint bit is set on a page, for the first
time after a checkpoint. In other words, a WAL record needs to be
written in the same circumstances as with page checksums, but the WAL
records are much smaller as they don't need to contain a full page
image, just the block number of the changed block.

Or maybe we'll write the full page image after all, like with page
checksums, just without calculating the checksums. It might be tricky to
skip the full-page image, because then a subsequent change of the page
(which isn't just a hint-bit update) needs to somehow know it needs to
take a full page image even though a WAL record for it was already written.

I think it would be worth estimating what this actually looks like in
terms of log write quantity. My inclication is to say that if it
increases log writes less than 10%, we don't need to provide an option
to turn it off.

It entirely depends on your workload. If it happens to be something
like:
INSERT INTO table (lots_of_data);
CHECKPOINT;
SELECT * FROM TABLE;

i.e. there's a checkpoint between loading the data and reading it - not
exactly all that uncommon - we'll need to log something for every
page. That can be rather noticeable. Especially as I think it will be
rather hard to log anything but a real FPI.

I really don't think everyone will want this. I am absolutely not
against providing an option to log enough information to make pg_rewind
work, but I think providing a command to do *safe* *planned* failover
will help in many more.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Magnus Hagander (#107)

Re: Patch for fail-back without fresh backup

On 2013-10-24 22:57:29 +0200, Magnus Hagander wrote:

In fact I've been considering suggesting we might want to retire the
difference between archive and hot_standby as wal_level, because the
difference is usually so small. And the advantage of hot_standby is in
almost every case worth it. Even in the archive recovery mode, being
able to do pause_at_recovery_target is extremely useful. And as you
say in (c) above, many users don't realize that until it's too late.

+1.

On 2013-10-25 15:16:30 +0900, Michael Paquier wrote:

But it has been a couple of releases since there have been no such
bugs, no?

One 'no' too much? Anyway, I think there have been more recent ones, but
it's infrequent enough that we can remove the level anyway.

FWIW, I've wondered if we shouldn't remove most of the EnableHotStandby
checks in xlog.c. There are way too many difference how StartupXLOG
behaves depending on HS.
E.g. I quite dislike that we do stuff like StartupCLOG at entirely
different times during recovery.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Sawada Masahiko

sawada.mshk@gmail.com

about 12 years ago

In reply to: Andres Freund (#110)

Re: Patch for fail-back without fresh backup

On Fri, Oct 25, 2013 at 8:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-10-24 13:51:52 -0700, Josh Berkus wrote:

It entirely depends on your workload. If it happens to be something
like:
INSERT INTO table (lots_of_data);
CHECKPOINT;
SELECT * FROM TABLE;

i.e. there's a checkpoint between loading the data and reading it - not
exactly all that uncommon - we'll need to log something for every
page. That can be rather noticeable. Especially as I think it will be
rather hard to log anything but a real FPI.

I really don't think everyone will want this. I am absolutely not
against providing an option to log enough information to make pg_rewind
work, but I think providing a command to do *safe* *planned* failover
will help in many more.

I think it is better providing as option to log enough information
such as new wal_level.
If user doesn't realize until it's too late, such information is
contained in checkpoint record?
For example if checkpoint record contained information of wal_level
then we can inform to user
using by such information.

BTW this information is useful only for pg_rewind?
Is there for anything else?
(Sorry it might has already been discussed..)

Regards,

-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113

Bruce Momjian

bruce@momjian.us

about 12 years ago

In reply to: Heikki Linnakangas (#105)

Re: Patch for fail-back without fresh backup

On Thu, Oct 24, 2013 at 11:14:14PM +0300, Heikki Linnakangas wrote:

On 24.10.2013 23:07, Josh Berkus wrote:

On 10/24/2013 11:12 AM, Heikki Linnakangas wrote:

On 24.10.2013 20:39, Josh Berkus wrote:

On 10/24/2013 04:15 AM, Pavan Deolasee wrote:

If we do what you are suggesting, it seems like a single line patch
to me.
In XLogSaveBufferForHint(), we probably need to look at this
additional GUC
to decide whether or not to backup the block.

Wait, what? Why are we having an additional GUC?

I'm opposed to the idea of having a GUC to enable failback. When would
anyone using replication ever want to disable that?

For example, if you're not replicating for high availability purposes,
but to keep a reporting standby up-to-date.

What kind of overhead are we talking about here? You probably said, but
I've had a mail client meltdown and lost a lot of my -hackers emails.

One extra WAL record whenever a hint bit is set on a page, for the
first time after a checkpoint. In other words, a WAL record needs to
be written in the same circumstances as with page checksums, but the
WAL records are much smaller as they don't need to contain a full
page image, just the block number of the changed block.

Or maybe we'll write the full page image after all, like with page
checksums, just without calculating the checksums. It might be
tricky to skip the full-page image, because then a subsequent change
of the page (which isn't just a hint-bit update) needs to somehow
know it needs to take a full page image even though a WAL record for
it was already written.

Sorry to be replying late to this, but while I am not worried about the
additional WAL volume, does this change require the transaction to now
wait for a WAL sync to disk before continuing? I thought that was the
down-side to WAL logging hint bits, not the WAL volume itself.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Bruce Momjian (#113)

Re: Patch for fail-back without fresh backup

Bruce Momjian escribiï¿½:

On Thu, Oct 24, 2013 at 11:14:14PM +0300, Heikki Linnakangas wrote:

On 24.10.2013 23:07, Josh Berkus wrote:

What kind of overhead are we talking about here?

One extra WAL record whenever a hint bit is set on a page, for the
first time after a checkpoint. In other words, a WAL record needs to
be written in the same circumstances as with page checksums, but the
WAL records are much smaller as they don't need to contain a full
page image, just the block number of the changed block.

Or maybe we'll write the full page image after all, like with page
checksums, just without calculating the checksums. It might be
tricky to skip the full-page image, because then a subsequent change
of the page (which isn't just a hint-bit update) needs to somehow
know it needs to take a full page image even though a WAL record for
it was already written.

Sorry to be replying late to this, but while I am not worried about the
additional WAL volume, does this change require the transaction to now
wait for a WAL sync to disk before continuing?

I don't think so. There's extra WAL written, but there's no
flush-and-wait until end of transaction (as has always been).

I thought that was the down-side to WAL logging hint bits, not the WAL
volume itself.

I don't think this is true either.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Jeff Janes

jeff.janes@gmail.com

about 12 years ago

In reply to: Alvaro Herrera (#114)

Re: Patch for fail-back without fresh backup

On Thu, Nov 21, 2013 at 12:31 PM, Alvaro Herrera
<alvherre@2ndquadrant.com>wrote:

Bruce Momjian escribió:

On Thu, Oct 24, 2013 at 11:14:14PM +0300, Heikki Linnakangas wrote:

On 24.10.2013 23:07, Josh Berkus wrote:

What kind of overhead are we talking about here?

One extra WAL record whenever a hint bit is set on a page, for the
first time after a checkpoint. In other words, a WAL record needs to
be written in the same circumstances as with page checksums, but the
WAL records are much smaller as they don't need to contain a full
page image, just the block number of the changed block.

Or maybe we'll write the full page image after all, like with page
checksums, just without calculating the checksums. It might be
tricky to skip the full-page image, because then a subsequent change
of the page (which isn't just a hint-bit update) needs to somehow
know it needs to take a full page image even though a WAL record for
it was already written.

Sorry to be replying late to this, but while I am not worried about the
additional WAL volume, does this change require the transaction to now
wait for a WAL sync to disk before continuing?

I don't think so. There's extra WAL written, but there's no
flush-and-wait until end of transaction (as has always been).

But if the transaction would not have otherwise generated WAL (i.e. a
select that did not have to do any HOT pruning, or an update with zero rows
matching the where condition), doesn't it now have to flush and wait when
it would otherwise not?

Cheers,

Jeff

#116

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Jeff Janes (#115)

Re: Patch for fail-back without fresh backup

On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:

But if the transaction would not have otherwise generated WAL (i.e. a
select that did not have to do any HOT pruning, or an update with zero rows
matching the where condition), doesn't it now have to flush and wait when
it would otherwise not?

We short circuit that if there's no xid assigned. Check
RecordTransactionCommit().

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Bruce Momjian

bruce@momjian.us

about 12 years ago

In reply to: Andres Freund (#116)

Re: Patch for fail-back without fresh backup

On Thu, Nov 21, 2013 at 11:43:34PM +0100, Andres Freund wrote:

On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:

But if the transaction would not have otherwise generated WAL (i.e. a
select that did not have to do any HOT pruning, or an update with zero rows
matching the where condition), doesn't it now have to flush and wait when
it would otherwise not?

We short circuit that if there's no xid assigned. Check
RecordTransactionCommit().

OK, that was my question, now answered. Thanks.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118

Jeff Janes

jeff.janes@gmail.com

almost 12 years ago

In reply to: Andres Freund (#116)

Re: Patch for fail-back without fresh backup

On Thu, Nov 21, 2013 at 2:43 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:

But if the transaction would not have otherwise generated WAL (i.e. a
select that did not have to do any HOT pruning, or an update with zero

rows

matching the where condition), doesn't it now have to flush and wait when
it would otherwise not?

We short circuit that if there's no xid assigned. Check
RecordTransactionCommit().

It looks like that only short-circuits the flush if both there is no xid
assigned, and !wrote_xlog. (line 1054 of xact.c)

I do see stalls on fdatasync on flush from select statements which had no
xid, but did generate xlog due to HOT pruning, I don't see why WAL logging
hint bits would be different.

Cheers,

Jeff

#119

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Jeff Janes (#118)

Re: Patch for fail-back without fresh backup

On 2014-01-16 09:25:51 -0800, Jeff Janes wrote:

On Thu, Nov 21, 2013 at 2:43 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:

But if the transaction would not have otherwise generated WAL (i.e. a
select that did not have to do any HOT pruning, or an update with zero

rows

matching the where condition), doesn't it now have to flush and wait when
it would otherwise not?

We short circuit that if there's no xid assigned. Check
RecordTransactionCommit().

It looks like that only short-circuits the flush if both there is no xid
assigned, and !wrote_xlog. (line 1054 of xact.c)

Hm. Indeed. Why don't we just always use the async commit behaviour for
that? I don't really see any significant dangers from doing so?

It's also rather odd to use the sync rep mechanisms in such
scenarios... The if() really should test markXidCommitted instead of
wrote_xlog.

I do see stalls on fdatasync on flush from select statements which had no
xid, but did generate xlog due to HOT pruning, I don't see why WAL logging
hint bits would be different.

Are the stalls at commit or while the select is running? If wal_buffers
is filled too fast, which can easily happen if loads of pages are hinted
and wal logged, that will happen independently from
RecordTransactionCommit().

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Jeff Janes

jeff.janes@gmail.com

almost 12 years ago

In reply to: Andres Freund (#119)

Re: Patch for fail-back without fresh backup

On Thu, Jan 16, 2014 at 9:37 AM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2014-01-16 09:25:51 -0800, Jeff Janes wrote:

On Thu, Nov 21, 2013 at 2:43 PM, Andres Freund <andres@2ndquadrant.com
wrote:

On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:

But if the transaction would not have otherwise generated WAL (i.e. a
select that did not have to do any HOT pruning, or an update with

zero

rows

matching the where condition), doesn't it now have to flush and wait

when

it would otherwise not?

We short circuit that if there's no xid assigned. Check
RecordTransactionCommit().

It looks like that only short-circuits the flush if both there is no xid
assigned, and !wrote_xlog. (line 1054 of xact.c)

Hm. Indeed. Why don't we just always use the async commit behaviour for
that? I don't really see any significant dangers from doing so?

I think the argument is that drawing the next value from a sequence can
generate xlog that needs to be flushed, but doesn't assign an xid.

I would think the sequence should flush that record before it hands out the
value, not before the commit, but...

It's also rather odd to use the sync rep mechanisms in such
scenarios... The if() really should test markXidCommitted instead of
wrote_xlog.

I do see stalls on fdatasync on flush from select statements which had no
xid, but did generate xlog due to HOT pruning, I don't see why WAL

logging

hint bits would be different.

Are the stalls at commit or while the select is running? If wal_buffers
is filled too fast, which can easily happen if loads of pages are hinted
and wal logged, that will happen independently from
RecordTransactionCommit().

In the real world, I'm not sure what the distribution is.

But in my present test case, they are coming almost exclusively from
RecordTransactionCommit.

I use "pgbench -T10" in a loop to generate dirty data and checkpoints (with
synchronous_commit on but with a BBU), and then to probe the consequences I
use:

pgbench -T10 -S -n --startup='set synchronous_commit='$f

(where --startup is an extension to pgbench proposed a few months ago)

Running the select-only query with synchronous_commit off almost completely
isolates it from the checkpoint drama that otherwise has a massive effect
on it. with synchronous_commit=on, it goes from 6000 tps normally to 30
tps during the checkpoint sync, with synchronous_commit=off it might dip to
4000 or so during the worst of it.

(To be clear, this is about the pruning, not the logging of the hint bits)

Cheers,

Jeff

#121

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Jeff Janes (#120)

Re: Patch for fail-back without fresh backup

On 2014-01-16 11:01:29 -0800, Jeff Janes wrote:

I think the argument is that drawing the next value from a sequence can
generate xlog that needs to be flushed, but doesn't assign an xid.

Then that should assign an xid. Which would yield correct behaviour with
async commit where it's currently *not* causing a WAL flush at all
unless a page boundary is crossed.

I've tried arguing that way before...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Jeff Janes (#120)

Re: Patch for fail-back without fresh backup

Jeff Janes <jeff.janes@gmail.com> writes:

I think the argument is that drawing the next value from a sequence can
generate xlog that needs to be flushed, but doesn't assign an xid.

I would think the sequence should flush that record before it hands out the
value, not before the commit, but...

IIRC the argument was that we'd flush WAL before any use of the value
could make it to disk. Which is true if you're just inserting it into
a table; perhaps less so if the client is doing something external to
the database with it. (But it'd be reasonable to say that clients
who want a guaranteed-good serial for such purposes should have to
commit the transaction that created the value.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers