Cascading replication and recovery_target_timeline='latest'

Started by Heikki Linnakangasover 13 years ago22 messages

hlinnaka@iki.fi

over 13 years ago

When a cascading standby launches a new walsender, it fetches the
current recovery timeline:

/*
* Use the recovery target timeline ID during recovery
*/
if (am_cascading_walsender)
ThisTimeLineID = GetRecoveryTargetTLI();

Comment in GetRecoveryTargetTLI() does this:

/* RecoveryTargetTLI doesn't change so we need no lock to copy it */
return XLogCtl->RecoveryTargetTLI;

That comment is not true. RecoveryTargetTLI can change during recovery,
if you set recovery_target_timeline='latest'. In 'latest' mode, when the
(apparent) end of WAL is reached, the archive is scanned for any new
timeline history files that may have appeared. If a new timeline is
found, RecoveryTargetTLI is updated, and recovery is continued on the
new timeline.

Aside from the missing locking, I wonder what that does to a cascaded
standby. If there is an active walsender running while RecoveryTargetTLI
is changed, I think what will happen is that the walsender will continue
to stream WAL from the old timeline, but because the startup process is
now actually replaying from a different timeline, the walsender will
send bogus WAL to the standby.

When a standby ends recovery, creates a new timeline, and switches to
normal operation, postmaster terminates all walsenders because of the
timeline change. But don't we have a race condition there, with similar
effect? It might take a while for a walsender to die, and in that
window, it might send bogus WAL to the cascaded standby.

- Heikki

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: Cascading replication and recovery_target_timeline='latest'

On Fri, Aug 31, 2012 at 5:03 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

When a cascading standby launches a new walsender, it fetches the current
recovery timeline:

/*
* Use the recovery target timeline ID during recovery
*/
if (am_cascading_walsender)
ThisTimeLineID = GetRecoveryTargetTLI();

Comment in GetRecoveryTargetTLI() does this:

/* RecoveryTargetTLI doesn't change so we need no lock to copy it */
return XLogCtl->RecoveryTargetTLI;

That comment is not true. RecoveryTargetTLI can change during recovery, if
you set recovery_target_timeline='latest'. In 'latest' mode, when the
(apparent) end of WAL is reached, the archive is scanned for any new
timeline history files that may have appeared. If a new timeline is found,
RecoveryTargetTLI is updated, and recovery is continued on the new timeline.

Right. We need lock there for now.

Aside from the missing locking, I wonder what that does to a cascaded
standby. If there is an active walsender running while RecoveryTargetTLI is
changed, I think what will happen is that the walsender will continue to
stream WAL from the old timeline, but because the startup process is now
actually replaying from a different timeline, the walsender will send bogus
WAL to the standby.

Good catch! That's really problem. To address that, we should terminate
all cascading walsenders when the timeline history file is read and
the recovery target timeline is changed?

When a standby ends recovery, creates a new timeline, and switches to normal
operation, postmaster terminates all walsenders because of the timeline
change. But don't we have a race condition there, with similar effect? It
might take a while for a walsender to die, and in that window, it might send
bogus WAL to the cascaded standby.

No, I think. The cascading walsender can send only WAL data, up to
min(replay_location, receive_location, restore_location). IOW, it sends
only replayed, received (i.e., streamed), and restored WAL files.
These WAL files belong to old timeline, so ISTM that the cascading
walsender cannot send any new-timeline WAL files.

Regards,

--
Fujii Masao

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Fujii Masao (#2)

Re: Cascading replication and recovery_target_timeline='latest'

On Sat, Sep 1, 2012 at 2:32 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Aug 31, 2012 at 5:03 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

When a cascading standby launches a new walsender, it fetches the current
recovery timeline:

/*
* Use the recovery target timeline ID during recovery
*/
if (am_cascading_walsender)
ThisTimeLineID = GetRecoveryTargetTLI();

Comment in GetRecoveryTargetTLI() does this:

/* RecoveryTargetTLI doesn't change so we need no lock to copy it */
return XLogCtl->RecoveryTargetTLI;

That comment is not true. RecoveryTargetTLI can change during recovery, if
you set recovery_target_timeline='latest'. In 'latest' mode, when the
(apparent) end of WAL is reached, the archive is scanned for any new
timeline history files that may have appeared. If a new timeline is found,
RecoveryTargetTLI is updated, and recovery is continued on the new timeline.

Right. We need lock there for now.

Aside from the missing locking, I wonder what that does to a cascaded
standby. If there is an active walsender running while RecoveryTargetTLI is
changed, I think what will happen is that the walsender will continue to
stream WAL from the old timeline, but because the startup process is now
actually replaying from a different timeline, the walsender will send bogus
WAL to the standby.

Good catch! That's really problem. To address that, we should terminate
all cascading walsenders when the timeline history file is read and
the recovery target timeline is changed?

This is not right fix. After terminating cascading walsenders, it
might take them
some time to come to an end, and during that time they might send bogus WAL
from old timeline. Currently there is no safeguard against sending bogus WAL
from old timeline. To implement such a safeguard, cascading walsender needs
to know when the timeline is updated and which is the last valid WAL file of
the timeline as the startup process knows. IOW, we need to change cascading
walsenders so that they also read and understand the timeline history files.
This is not easy fix at this stage (9.2.0 is about to be released...).

So, as one idea, I'm thiking to just forbid cascading replication when
recovery_target_timeline is set to 'latest'. Thought?

Regards,

--
Fujii Masao

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Fujii Masao (#3)

Re: Cascading replication and recovery_target_timeline='latest'

On 03.09.2012 10:43, Fujii Masao wrote:

On Sat, Sep 1, 2012 at 2:32 AM, Fujii Masao<masao.fujii@gmail.com> wrote:

On Fri, Aug 31, 2012 at 5:03 PM, Heikki Linnakangas<hlinnaka@iki.fi> wrote:

Aside from the missing locking, I wonder what that does to a cascaded
standby. If there is an active walsender running while RecoveryTargetTLI is
changed, I think what will happen is that the walsender will continue to
stream WAL from the old timeline, but because the startup process is now
actually replaying from a different timeline, the walsender will send bogus
WAL to the standby.

Good catch! That's really problem. To address that, we should terminate
all cascading walsenders when the timeline history file is read and
the recovery target timeline is changed?

This is not right fix. After terminating cascading walsenders, it
might take them
some time to come to an end, and during that time they might send bogus WAL
from old timeline. Currently there is no safeguard against sending bogus WAL
from old timeline. To implement such a safeguard, cascading walsender needs
to know when the timeline is updated and which is the last valid WAL file of
the timeline as the startup process knows. IOW, we need to change cascading
walsenders so that they also read and understand the timeline history files.
This is not easy fix at this stage (9.2.0 is about to be released...).

So, as one idea, I'm thiking to just forbid cascading replication when
recovery_target_timeline is set to 'latest'. Thought?

Hmm, I was thinking that when walsender gets the position it can send
the WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the
current recovery timeline. If it has changed, refuse to send the new WAL
and terminate. That would be a fairly small change, it would just close
the window between requesting walsenders to terminate and them actually
terminating.

- Heikki

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Heikki Linnakangas (#4)

Re: Cascading replication and recovery_target_timeline='latest'

On Tue, Sep 4, 2012 at 7:07 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 03.09.2012 10:43, Fujii Masao wrote:

On Sat, Sep 1, 2012 at 2:32 AM, Fujii Masao<masao.fujii@gmail.com> wrote:

On Fri, Aug 31, 2012 at 5:03 PM, Heikki Linnakangas<hlinnaka@iki.fi>
wrote:

Aside from the missing locking, I wonder what that does to a cascaded

standby. If there is an active walsender running while RecoveryTargetTLI
is
changed, I think what will happen is that the walsender will continue to
stream WAL from the old timeline, but because the startup process is now
actually replaying from a different timeline, the walsender will send
bogus
WAL to the standby.

Good catch! That's really problem. To address that, we should terminate
all cascading walsenders when the timeline history file is read and
the recovery target timeline is changed?

This is not right fix. After terminating cascading walsenders, it
might take them
some time to come to an end, and during that time they might send bogus
WAL
from old timeline. Currently there is no safeguard against sending bogus
WAL
from old timeline. To implement such a safeguard, cascading walsender
needs
to know when the timeline is updated and which is the last valid WAL file
of
the timeline as the startup process knows. IOW, we need to change
cascading
walsenders so that they also read and understand the timeline history
files.
This is not easy fix at this stage (9.2.0 is about to be released...).

So, as one idea, I'm thiking to just forbid cascading replication when
recovery_target_timeline is set to 'latest'. Thought?

Hmm, I was thinking that when walsender gets the position it can send the
WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current
recovery timeline. If it has changed, refuse to send the new WAL and
terminate. That would be a fairly small change, it would just close the
window between requesting walsenders to terminate and them actually
terminating.

Yeah, sounds good. Could you implement the patch? If you don't have time,
I will....

Regards,

--
Fujii Masao

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Fujii Masao (#5)

Re: Cascading replication and recovery_target_timeline='latest'

On 03.09.2012 16:25, Fujii Masao wrote:

On Tue, Sep 4, 2012 at 7:07 AM, Heikki Linnakangas<hlinnaka@iki.fi> wrote:

Hmm, I was thinking that when walsender gets the position it can send the
WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current
recovery timeline. If it has changed, refuse to send the new WAL and
terminate. That would be a fairly small change, it would just close the
window between requesting walsenders to terminate and them actually
terminating.

Yeah, sounds good. Could you implement the patch? If you don't have time,
I will....

I'll give it a shot..

- Heikki

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Heikki Linnakangas (#6)

1 attachment(s)

Re: Cascading replication and recovery_target_timeline='latest'

On 03.09.2012 16:26, Heikki Linnakangas wrote:

On 03.09.2012 16:25, Fujii Masao wrote:

On Tue, Sep 4, 2012 at 7:07 AM, Heikki Linnakangas<hlinnaka@iki.fi>
wrote:

Hmm, I was thinking that when walsender gets the position it can send
the
WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the
current
recovery timeline. If it has changed, refuse to send the new WAL and
terminate. That would be a fairly small change, it would just close the
window between requesting walsenders to terminate and them actually
terminating.

Yeah, sounds good. Could you implement the patch? If you don't have time,
I will....

I'll give it a shot..

So, this is what I came up with, please review.

- Heikki

Attachments:

disconnect-walsenders-on-target-tli-change.patchtext/x-diff; name=disconnect-walsenders-on-target-tli-change.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a7762ea..c00b277 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -407,7 +407,6 @@ typedef struct XLogCtlData
 	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
 	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
 	TimeLineID	ThisTimeLineID;
-	TimeLineID	RecoveryTargetTLI;
 
 	/*
 	 * archiveCleanupCommand is read from recovery.conf but needs to be in
@@ -456,6 +455,8 @@ typedef struct XLogCtlData
 	XLogRecPtr	recoveryLastRecPtr;
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
+	/* current effective recovery target timeline. */
+	TimeLineID	RecoveryTargetTLI;
 
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
@@ -4470,12 +4471,17 @@ rescanLatestTimeLine(void)
 							ThisTimeLineID)));
 		else
 		{
+			/* use volatile pointer to prevent code rearrangement */
+			volatile XLogCtlData *xlogctl = XLogCtl;
+
 			/* Switch target */
 			recoveryTargetTLI = newtarget;
 			list_free(expectedTLIs);
 			expectedTLIs = newExpectedTLIs;
 
-			XLogCtl->RecoveryTargetTLI = recoveryTargetTLI;
+			SpinLockAcquire(&xlogctl->info_lck);
+			xlogctl->RecoveryTargetTLI = recoveryTargetTLI;
+			SpinLockRelease(&xlogctl->info_lck);
 
 			ereport(LOG,
 					(errmsg("new target timeline is %u",
@@ -7518,8 +7524,15 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 TimeLineID
 GetRecoveryTargetTLI(void)
 {
-	/* RecoveryTargetTLI doesn't change so we need no lock to copy it */
-	return XLogCtl->RecoveryTargetTLI;
+	/* use volatile pointer to prevent code rearrangement */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+	TimeLineID result;
+
+	SpinLockAcquire(&xlogctl->info_lck);
+	result = xlogctl->RecoveryTargetTLI;
+	SpinLockRelease(&xlogctl->info_lck);
+
+	return result;
 }
 
 /*
@@ -8309,7 +8322,7 @@ CreateRestartPoint(int flags)
 		XLogRecPtr	endptr;
 
 		/* Get the current (or recent) end of xlog */
-		endptr = GetStandbyFlushRecPtr();
+		endptr = GetStandbyFlushRecPtr(NULL);
 
 		KeepLogSeg(endptr, &_logSegNo);
 		_logSegNo--;
@@ -9819,13 +9832,14 @@ do_pg_abort_backup(void)
  * Get latest redo apply position.
  *
  * Optionally, returns the end byte position of the last restored
- * WAL segment. Callers not interested in that value may pass
- * NULL for restoreLastRecPtr.
+ * WAL segment, and the current recovery target timeline. Callers not
+ * interested in those values may pass NULL for restoreLastRecPtr and
+ * targetTLI.
  *
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr)
+GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr, TimeLineID *targetTli)
 {
 	/* use volatile pointer to prevent code rearrangement */
 	volatile XLogCtlData *xlogctl = XLogCtl;
@@ -9835,6 +9849,8 @@ GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr)
 	recptr = xlogctl->recoveryLastRecPtr;
 	if (restoreLastRecPtr)
 		*restoreLastRecPtr = xlogctl->restoreLastRecPtr;
+	if (targetTli)
+		*targetTli = xlogctl->RecoveryTargetTLI;
 	SpinLockRelease(&xlogctl->info_lck);
 
 	return recptr;
@@ -9843,16 +9859,18 @@ GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr)
 /*
  * Get current standby flush position, ie, the last WAL position
  * known to be fsync'd to disk in standby.
+ *
+ * If 'tli' is not NULL, it's set to the current recovery target timeline.
  */
 XLogRecPtr
-GetStandbyFlushRecPtr(void)
+GetStandbyFlushRecPtr(TimeLineID *targetTLI)
 {
 	XLogRecPtr	receivePtr;
 	XLogRecPtr	replayPtr;
 	XLogRecPtr	restorePtr;
 
 	receivePtr = GetWalRcvWriteRecPtr(NULL);
-	replayPtr = GetXLogReplayRecPtr(&restorePtr);
+	replayPtr = GetXLogReplayRecPtr(&restorePtr, targetTLI);
 
 	if (XLByteLT(receivePtr, replayPtr))
 		return XLByteLT(replayPtr, restorePtr) ? restorePtr : replayPtr;
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index d345761..b7c864e 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -247,7 +247,7 @@ pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
 	XLogRecPtr	recptr;
 	char		location[MAXFNAMELEN];
 
-	recptr = GetXLogReplayRecPtr(NULL);
+	recptr = GetXLogReplayRecPtr(NULL, NULL);
 
 	if (recptr == 0)
 		PG_RETURN_NULL();
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8694570..3ce9eb7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -278,7 +278,8 @@ WalReceiverMain(void)
 	DisableWalRcvImmediateExit();
 
 	/* Initialize LogstreamResult, reply_message and feedback_message */
-	LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL);
+	LogstreamResult.Write = LogstreamResult.Flush =
+		GetXLogReplayRecPtr(NULL, NULL);
 	MemSet(&reply_message, 0, sizeof(reply_message));
 	MemSet(&feedback_message, 0, sizeof(feedback_message));
 
@@ -654,7 +655,7 @@ XLogWalRcvSendReply(void)
 	/* Construct a new message */
 	reply_message.write = LogstreamResult.Write;
 	reply_message.flush = LogstreamResult.Flush;
-	reply_message.apply = GetXLogReplayRecPtr(NULL);
+	reply_message.apply = GetXLogReplayRecPtr(NULL, NULL);
 	reply_message.sendTime = now;
 
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 9eba180..68d0d6f 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -258,7 +258,7 @@ GetReplicationApplyDelay(void)
 	receivePtr = walrcv->receivedUpto;
 	SpinLockRelease(&walrcv->mutex);
 
-	replayPtr = GetXLogReplayRecPtr(NULL);
+	replayPtr = GetXLogReplayRecPtr(NULL, NULL);
 
 	if (XLByteEQ(receivePtr, replayPtr))
 		return 0;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 38f7a3f..ae508da 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -303,7 +303,7 @@ IdentifySystem(void)
 			 GetSystemIdentifier());
 	snprintf(tli, sizeof(tli), "%u", ThisTimeLineID);
 
-	logptr = am_cascading_walsender ? GetStandbyFlushRecPtr() : GetInsertRecPtr();
+	logptr = am_cascading_walsender ? GetStandbyFlushRecPtr(NULL) : GetInsertRecPtr();
 
 	snprintf(xpos, sizeof(xpos), "%X/%X", (uint32) (logptr >> 32), (uint32) logptr);
 
@@ -1137,7 +1137,31 @@ XLogSend(char *msgbuf, bool *caughtup)
 	 * subsequently crashes and restarts, slaves must not have applied any WAL
 	 * that gets lost on the master.
 	 */
-	SendRqstPtr = am_cascading_walsender ? GetStandbyFlushRecPtr() : GetFlushRecPtr();
+	if (am_cascading_walsender)
+	{
+		TimeLineID	currentTargetTLI;
+		SendRqstPtr = GetStandbyFlushRecPtr(&currentTargetTLI);
+
+		/*
+		 * If the recovery target timeline changed, bail out. It's a bit
+		 * unfortunate that we have to just disconnect, but there is no way
+		 * to tell the client that the timeline changed. We also don't know
+		 * exactly where the switch happened, so we cannot safely try to send
+		 * up to the switchover point before disconnecting.
+		 */
+		if (currentTargetTLI != ThisTimeLineID)
+		{
+			if (!walsender_ready_to_stop)
+				ereport(LOG,
+						(errmsg("terminating walsender process to force cascaded standby "
+								"to update timeline and reconnect")));
+			walsender_ready_to_stop = true;
+			*caughtup = true;
+			return;
+		}
+	}
+	else
+		GetFlushRecPtr();
 
 	/* Quick exit if nothing to do */
 	if (XLByteLE(SendRqstPtr, sentPtr))
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ec79870..771fac9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -285,8 +285,9 @@ extern bool RecoveryInProgress(void);
 extern bool HotStandbyActive(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
-extern XLogRecPtr GetStandbyFlushRecPtr(void);
+extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr,
+					TimeLineID *targetTli);
+extern XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *targetTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern bool RecoveryIsPaused(void);

Dimitri Fontaine

dimitri@2ndQuadrant.fr

over 13 years ago

In reply to: Heikki Linnakangas (#4)

Re: Cascading replication and recovery_target_timeline='latest'

Heikki Linnakangas <hlinnaka@iki.fi> writes:

Hmm, I was thinking that when walsender gets the position it can send the
WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current
recovery timeline. If it has changed, refuse to send the new WAL and
terminate. That would be a fairly small change, it would just close the
window between requesting walsenders to terminate and them actually
terminating.

It looks to me like a bug fix that also applies to non cascading
situation. Is that right?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

Josh Berkus

josh@agliodbs.com

over 13 years ago

In reply to: Dimitri Fontaine (#8)

Re: Cascading replication and recovery_target_timeline='latest'

Guys,

Is this a patch to 9.3?

i.e. do we need to delay the release for this?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#10

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Dimitri Fontaine (#8)

Re: Cascading replication and recovery_target_timeline='latest'

On 04.09.2012 03:02, Dimitri Fontaine wrote:

Heikki Linnakangas<hlinnaka@iki.fi> writes:

Hmm, I was thinking that when walsender gets the position it can send the
WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current
recovery timeline. If it has changed, refuse to send the new WAL and
terminate. That would be a fairly small change, it would just close the
window between requesting walsenders to terminate and them actually
terminating.

It looks to me like a bug fix that also applies to non cascading
situation. Is that right?

No, only cascading replication is affected. In non-cascading situation,
the timeline never changes in the master. It's only in cascading mode
that you have a problem, where the standby can cross timelines while
it's replaying the WAL, and also sending it over to cascading standby.

- Heikki

#11

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Josh Berkus (#9)

Re: Cascading replication and recovery_target_timeline='latest'

On 04.09.2012 15:41, Josh Berkus wrote:

Guys,

Is this a patch to 9.3?

i.e. do we need to delay the release for this?

It is for 9.2. I'll do a little bit more testing, and barring any
issues, commit the patch. What exactly is the schedule? Do we need to do
a RC2 because of this?

- Heikki

#12

Josh Berkus

josh@agliodbs.com

over 13 years ago

In reply to: Heikki Linnakangas (#11)

Re: Cascading replication and recovery_target_timeline='latest'

Heikki,

It is for 9.2. I'll do a little bit more testing, and barring any
issues, commit the patch. What exactly is the schedule? Do we need to do
a RC2 because of this?

We're currently scheduled to release next week. If we need to do an
RC2, we're going to have to do some fast rescheduling; we've already
started the publicity machine.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#13

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Josh Berkus (#12)

Re: Cascading replication and recovery_target_timeline='latest'

Josh Berkus <josh@agliodbs.com> writes:

Heikki,

It is for 9.2. I'll do a little bit more testing, and barring any
issues, commit the patch. What exactly is the schedule? Do we need to do
a RC2 because of this?

We're currently scheduled to release next week. If we need to do an
RC2, we're going to have to do some fast rescheduling; we've already
started the publicity machine.

At this point I would argue that the only thing that should abort the
launch is a bad regression. Minor bugs in new features (and this must
be minor if it wasn't noticed before) don't qualify.

Having said that, it'd be good to get it fixed if we can. The schedule
says to wrap 9.2.0 Thursday evening --- Heikki, can you get this fixed
tomorrow (Wednesday)?

regards, tom lane

#14

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Heikki Linnakangas (#7)

Re: Cascading replication and recovery_target_timeline='latest'

On 03.09.2012 17:40, Heikki Linnakangas wrote:

On 03.09.2012 16:26, Heikki Linnakangas wrote:

On 03.09.2012 16:25, Fujii Masao wrote:

On Tue, Sep 4, 2012 at 7:07 AM, Heikki Linnakangas<hlinnaka@iki.fi>
wrote:

Hmm, I was thinking that when walsender gets the position it can send
the
WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the
current
recovery timeline. If it has changed, refuse to send the new WAL and
terminate. That would be a fairly small change, it would just close the
window between requesting walsenders to terminate and them actually
terminating.

Yeah, sounds good. Could you implement the patch? If you don't have
time,
I will....

I'll give it a shot..

So, this is what I came up with, please review.

While testing, I bumped into another related bug: When a WAL segment is
restored from the archive, we let a walsender to send that whole WAL
segment to a cascading standby. However, there's no guarantee that the
restored WAL segment is complete. In particular, if a timeline changes
within that segment, e.g 000000010000000000000004, that segment will be
only partially full, and the WAL continues at segment
000000020000000000000004, at the next timeline. This can also happen if
you copy a partial WAL segment to the archive, for example from a
crashed master server. Or if you have set up record-based WAL shipping
not using streaming replication, per
http://www.postgresql.org/docs/devel/static/log-shipping-alternative.html#WARM-STANDBY-RECORD.
That manual page says you can only deal with whole WAL files that way,
but I think with standby_mode='on', that's actually no longer true.

So all in all, it seems like a shaky assumption that once you've
restored a WAL file from the archive, you're free to stream it to a
cascading slave. I think it would be more robust to limit it to
streaming the file only up to the point that it's been replayed - and
thus verified - in the 1st standby. If everyone is OK with that change
in behavior, the fix is simple.

- Heikki

#15

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Tom Lane (#13)

1 attachment(s)

Re: Cascading replication and recovery_target_timeline='latest'

On 04.09.2012 16:50, Tom Lane wrote:

Josh Berkus<josh@agliodbs.com> writes:

Heikki,

It is for 9.2. I'll do a little bit more testing, and barring any
issues, commit the patch. What exactly is the schedule? Do we need to do
a RC2 because of this?

We're currently scheduled to release next week. If we need to do an
RC2, we're going to have to do some fast rescheduling; we've already
started the publicity machine.

At this point I would argue that the only thing that should abort the
launch is a bad regression. Minor bugs in new features (and this must
be minor if it wasn't noticed before) don't qualify.

Having said that, it'd be good to get it fixed if we can. The schedule
says to wrap 9.2.0 Thursday evening --- Heikki, can you get this fixed
tomorrow (Wednesday)?

The attached patch fixes it for me. It fixes the original problem, by
adding the missing locking and terminating walsenders on a target
timeline change, and also changes the behavior wrt. WAL segments
restored from the archive, as I just suggested in another email
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00206.php).

The test case I've been using is a master and two standbys. The first
standby is set up to connect to the master with streaming replication,
and the other standby is set up to connect to the 1st standby, ie. it's
a cascading slave. In addition, the master is set up to do WAL archiving
to a directory, and both standbys have a restore_command to read from
that archive, and restore_target_timeline='latest'. After the master and
both standbys are running, I create a dummy recovery.conf file in
master's data directory, with just "restore_command='/bin/false'" in it,
and restart the master. That forces a timeline change in the master.
With the patch, the 1st standby will notice the new timeline in the
archive, switch to that, and reconnect to the master. The cascading
connection to the 2nd standby is terminated because of the timeline
change, the 2nd standby will also scan the archive and pick up the new
timeline, reconnect to the 1st standby, and be in sync again.

- Heikki

Attachments:

disconnect-walsenders-on-target-tli-change-2.patchtext/x-diff; name=disconnect-walsenders-on-target-tli-change-2.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b75dab5..be01d63 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -406,7 +406,6 @@ typedef struct XLogCtlData
 	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
 	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
 	TimeLineID	ThisTimeLineID;
-	TimeLineID	RecoveryTargetTLI;
 
 	/*
 	 * archiveCleanupCommand is read from recovery.conf but needs to be in
@@ -455,6 +454,8 @@ typedef struct XLogCtlData
 	XLogRecPtr	recoveryLastRecPtr;
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
+	/* current effective recovery target timeline */
+	TimeLineID	RecoveryTargetTLI;
 
 	/*
 	 * timestamp of when we started replaying the current chunk of WAL data,
@@ -4467,12 +4468,17 @@ rescanLatestTimeLine(void)
 							ThisTimeLineID)));
 		else
 		{
+			/* use volatile pointer to prevent code rearrangement */
+			volatile XLogCtlData *xlogctl = XLogCtl;
+
 			/* Switch target */
 			recoveryTargetTLI = newtarget;
 			list_free(expectedTLIs);
 			expectedTLIs = newExpectedTLIs;
 
-			XLogCtl->RecoveryTargetTLI = recoveryTargetTLI;
+			SpinLockAcquire(&xlogctl->info_lck);
+			xlogctl->RecoveryTargetTLI = recoveryTargetTLI;
+			SpinLockRelease(&xlogctl->info_lck);
 
 			ereport(LOG,
 					(errmsg("new target timeline is %u",
@@ -7519,13 +7525,20 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
- * GetRecoveryTargetTLI - get the recovery target timeline ID
+ * GetRecoveryTargetTLI - get the current recovery target timeline ID
  */
 TimeLineID
 GetRecoveryTargetTLI(void)
 {
-	/* RecoveryTargetTLI doesn't change so we need no lock to copy it */
-	return XLogCtl->RecoveryTargetTLI;
+	/* use volatile pointer to prevent code rearrangement */
+	volatile XLogCtlData *xlogctl = XLogCtl;
+	TimeLineID result;
+
+	SpinLockAcquire(&xlogctl->info_lck);
+	result = xlogctl->RecoveryTargetTLI;
+	SpinLockRelease(&xlogctl->info_lck);
+
+	return result;
 }
 
 /*
@@ -8321,7 +8334,7 @@ CreateRestartPoint(int flags)
 		XLogRecPtr	endptr;
 
 		/* Get the current (or recent) end of xlog */
-		endptr = GetStandbyFlushRecPtr();
+		endptr = GetStandbyFlushRecPtr(NULL);
 
 		KeepLogSeg(endptr, &_logId, &_logSeg);
 		PrevLogSeg(_logId, _logSeg);
@@ -9837,14 +9850,13 @@ do_pg_abort_backup(void)
 /*
  * Get latest redo apply position.
  *
- * Optionally, returns the end byte position of the last restored
- * WAL segment. Callers not interested in that value may pass
- * NULL for restoreLastRecPtr.
+ * Optionally, returns the current recovery target timeline. Callers not
+ * interested in that may pass NULL for targetTLI.
  *
  * Exported to allow WALReceiver to read the pointer directly.
  */
 XLogRecPtr
-GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr)
+GetXLogReplayRecPtr(TimeLineID *targetTli)
 {
 	/* use volatile pointer to prevent code rearrangement */
 	volatile XLogCtlData *xlogctl = XLogCtl;
@@ -9852,8 +9864,8 @@ GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr)
 
 	SpinLockAcquire(&xlogctl->info_lck);
 	recptr = xlogctl->recoveryLastRecPtr;
-	if (restoreLastRecPtr)
-		*restoreLastRecPtr = xlogctl->restoreLastRecPtr;
+	if (targetTli)
+		*targetTli = xlogctl->RecoveryTargetTLI;
 	SpinLockRelease(&xlogctl->info_lck);
 
 	return recptr;
@@ -9862,21 +9874,23 @@ GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr)
 /*
  * Get current standby flush position, ie, the last WAL position
  * known to be fsync'd to disk in standby.
+ *
+ * If 'targetTli' is not NULL, it's set to the current recovery target
+ * timeline.
  */
 XLogRecPtr
-GetStandbyFlushRecPtr(void)
+GetStandbyFlushRecPtr(TimeLineID *targetTLI)
 {
 	XLogRecPtr	receivePtr;
 	XLogRecPtr	replayPtr;
-	XLogRecPtr	restorePtr;
 
 	receivePtr = GetWalRcvWriteRecPtr(NULL);
-	replayPtr = GetXLogReplayRecPtr(&restorePtr);
+	replayPtr = GetXLogReplayRecPtr(targetTLI);
 
 	if (XLByteLT(receivePtr, replayPtr))
-		return XLByteLT(replayPtr, restorePtr) ? restorePtr : replayPtr;
+		return replayPtr;
 	else
-		return XLByteLT(receivePtr, restorePtr) ? restorePtr : receivePtr;
+		return receivePtr;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3f060b8..064ddd5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -299,7 +299,7 @@ IdentifySystem(void)
 			 GetSystemIdentifier());
 	snprintf(tli, sizeof(tli), "%u", ThisTimeLineID);
 
-	logptr = am_cascading_walsender ? GetStandbyFlushRecPtr() : GetInsertRecPtr();
+	logptr = am_cascading_walsender ? GetStandbyFlushRecPtr(NULL) : GetInsertRecPtr();
 
 	snprintf(xpos, sizeof(xpos), "%X/%X",
 			 logptr.xlogid, logptr.xrecoff);
@@ -1144,7 +1144,31 @@ XLogSend(char *msgbuf, bool *caughtup)
 	 * subsequently crashes and restarts, slaves must not have applied any WAL
 	 * that gets lost on the master.
 	 */
-	SendRqstPtr = am_cascading_walsender ? GetStandbyFlushRecPtr() : GetFlushRecPtr();
+	if (am_cascading_walsender)
+	{
+		TimeLineID	currentTargetTLI;
+		SendRqstPtr = GetStandbyFlushRecPtr(&currentTargetTLI);
+
+		/*
+		 * If the recovery target timeline changed, bail out. It's a bit
+		 * unfortunate that we have to just disconnect, but there is no way
+		 * to tell the client that the timeline changed. We also don't know
+		 * exactly where the switch happened, so we cannot safely try to send
+		 * up to the switchover point before disconnecting.
+		 */
+		if (currentTargetTLI != ThisTimeLineID)
+		{
+			if (!walsender_ready_to_stop)
+				ereport(LOG,
+						(errmsg("terminating walsender process to force cascaded standby "
+								"to update timeline and reconnect")));
+			walsender_ready_to_stop = true;
+			*caughtup = true;
+			return;
+		}
+	}
+	else
+		SendRqstPtr = GetFlushRecPtr();
 
 	/* Quick exit if nothing to do */
 	if (XLByteLE(SendRqstPtr, sentPtr))
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index df5f232..682bbe6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -286,8 +286,8 @@ extern bool RecoveryInProgress(void);
 extern bool HotStandbyActive(void);
 extern bool XLogInsertAllowed(void);
 extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
-extern XLogRecPtr GetXLogReplayRecPtr(XLogRecPtr *restoreLastRecPtr);
-extern XLogRecPtr GetStandbyFlushRecPtr(void);
+extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *targetTli);
+extern XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *targetTLI);
 extern XLogRecPtr GetXLogInsertRecPtr(void);
 extern XLogRecPtr GetXLogWriteRecPtr(void);
 extern bool RecoveryIsPaused(void);

#16

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Heikki Linnakangas (#15)

Re: Cascading replication and recovery_target_timeline='latest'

On 04.09.2012 17:34, Heikki Linnakangas wrote:

On 04.09.2012 16:50, Tom Lane wrote:

Josh Berkus<josh@agliodbs.com> writes:

Heikki,

It is for 9.2. I'll do a little bit more testing, and barring any
issues, commit the patch. What exactly is the schedule? Do we need
to do
a RC2 because of this?

We're currently scheduled to release next week. If we need to do an
RC2, we're going to have to do some fast rescheduling; we've already
started the publicity machine.

At this point I would argue that the only thing that should abort the
launch is a bad regression. Minor bugs in new features (and this must
be minor if it wasn't noticed before) don't qualify.

Having said that, it'd be good to get it fixed if we can. The schedule
says to wrap 9.2.0 Thursday evening --- Heikki, can you get this fixed
tomorrow (Wednesday)?

The attached patch fixes it for me. It fixes the original problem, by
adding the missing locking and terminating walsenders on a target
timeline change, and also changes the behavior wrt. WAL segments
restored from the archive, as I just suggested in another email
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00206.php).

Committed that.

- Heikki

#17

Peter Eisentraut

peter_e@gmx.net

over 13 years ago

In reply to: Heikki Linnakangas (#16)

Re: Cascading replication and recovery_target_timeline='latest'

On Tue, 2012-09-04 at 19:34 -0700, Heikki Linnakangas wrote:

The attached patch fixes it for me. It fixes the original problem, by
adding the missing locking and terminating walsenders on a target
timeline change, and also changes the behavior wrt. WAL segments
restored from the archive, as I just suggested in another email
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00206.php).

Committed that.

New compiler warnings:

xlog.c: In function ‘XLogFileRead’:
xlog.c:2785:14: error: unused variable ‘endptr’ [-Werror=unused-variable]
xlog.c:2784:25: error: unused variable ‘xlogctl’ [-Werror=unused-variable]

#18

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Peter Eisentraut (#17)

Re: Cascading replication and recovery_target_timeline='latest'

On 04.09.2012 21:56, Peter Eisentraut wrote:

On Tue, 2012-09-04 at 19:34 -0700, Heikki Linnakangas wrote:

The attached patch fixes it for me. It fixes the original problem, by
adding the missing locking and terminating walsenders on a target
timeline change, and also changes the behavior wrt. WAL segments
restored from the archive, as I just suggested in another email
(http://archives.postgresql.org/pgsql-hackers/2012-09/msg00206.php).

Committed that.

New compiler warnings:

xlog.c: In function ‘XLogFileRead’:
xlog.c:2785:14: error: unused variable ‘endptr’ [-Werror=unused-variable]
xlog.c:2784:25: error: unused variable ‘xlogctl’ [-Werror=unused-variable]

Fixed, thanks.

- Heikki

#19

Dimitri Fontaine

dimitri@2ndQuadrant.fr

over 13 years ago

In reply to: Heikki Linnakangas (#10)

Re: Cascading replication and recovery_target_timeline='latest'

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 04.09.2012 03:02, Dimitri Fontaine wrote:

Heikki Linnakangas<hlinnaka@iki.fi> writes:

Hmm, I was thinking that when walsender gets the position it can send the
WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current
recovery timeline. If it has changed, refuse to send the new WAL and
terminate. That would be a fairly small change, it would just close the
window between requesting walsenders to terminate and them actually
terminating.

No, only cascading replication is affected. In non-cascading situation, the
timeline never changes in the master. It's only in cascading mode that you
have a problem, where the standby can cross timelines while it's replaying
the WAL, and also sending it over to cascading standby.

It seems to me that it applies to connecting a standby to a newly
promoted standby too, as the timeline did change in this case too.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#20

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Dimitri Fontaine (#19)

Re: Cascading replication and recovery_target_timeline='latest'

On 05.09.2012 01:03, Dimitri Fontaine wrote:

Heikki Linnakangas<hlinnaka@iki.fi> writes:

On 04.09.2012 03:02, Dimitri Fontaine wrote:

Heikki Linnakangas<hlinnaka@iki.fi> writes:

Hmm, I was thinking that when walsender gets the position it can send the
WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current
recovery timeline. If it has changed, refuse to send the new WAL and
terminate. That would be a fairly small change, it would just close the
window between requesting walsenders to terminate and them actually
terminating.

No, only cascading replication is affected. In non-cascading situation, the
timeline never changes in the master. It's only in cascading mode that you
have a problem, where the standby can cross timelines while it's replaying
the WAL, and also sending it over to cascading standby.

It seems to me that it applies to connecting a standby to a newly
promoted standby too, as the timeline did change in this case too.

I was worried about that too at first, but Fujii pointed out that's OK:
see last paragraph at
http://archives.postgresql.org/pgsql-hackers/2012-08/msg01203.php.

If you connect to a standby that was already promoted to new master,
it's no different from connecting to a master in general. It works. If
you connect just before a standby is promoted, it works because a
cascading standby pays attention to the recovery target timeline, and
the pointer to last replayed WAL record. Promoting a standby doesn't
change recovery target timeline or the last replayed WAL record, it sets
XLogCtl->ThisTimeLineID. So the walsender in cascading mode will send
the WAL up to where the promotion happened, but will stop there until
it's terminated by the signal.

- Heikki

#21

Dimitri Fontaine

dimitri@2ndQuadrant.fr

over 13 years ago

In reply to: Heikki Linnakangas (#20)

Re: Cascading replication and recovery_target_timeline='latest'

Heikki Linnakangas <hlinnaka@iki.fi> writes:

I was worried about that too at first, but Fujii pointed out that's OK: see
last paragraph at
http://archives.postgresql.org/pgsql-hackers/2012-08/msg01203.php.

Mmm, ok.

I'm worried about master-standby-standby setup where the master
disappear, we promote a standby and the second standby now feeds from
the newly promoted standby. Well we have to reconnect manually in this
case, but don't we need some similar stopgaps?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#22

Heikki Linnakangas

hlinnaka@iki.fi

over 13 years ago

In reply to: Dimitri Fontaine (#21)

Re: Cascading replication and recovery_target_timeline='latest'

On 05.09.2012 07:55, Dimitri Fontaine wrote:

Heikki Linnakangas<hlinnaka@iki.fi> writes:

I was worried about that too at first, but Fujii pointed out that's OK: see
last paragraph at
http://archives.postgresql.org/pgsql-hackers/2012-08/msg01203.php.

Mmm, ok.

I'm worried about master-standby-standby setup where the master
disappear, we promote a standby and the second standby now feeds from
the newly promoted standby. Well we have to reconnect manually in this
case, but don't we need some similar stopgaps?

The second standby will have to reconnect, but it will happen automatically.

- Heikki