Teaching pg_receivexlog to follow timeline switches

Started by Heikki Linnakangasalmost 13 years ago18 messages
#1Heikki Linnakangas
hlinnakangas@vmware.com
1 attachment(s)

Now that a standby server can follow timeline switches through streaming
replication, we should do teach pg_receivexlog to do the same. Patch
attached.

I made one change to the way START_STREAMING command works, to better
support this. When a standby server reaches the timeline it's streaming
from the master, it stops streaming, fetches any missing timeline
history files, and parses the history file of the latest timeline to
figure out where to continue. However, I don't want to parse timeline
history files in pg_receivexlog. Better to keep it simple. So instead, I
modified the server-side code for START_STREAMING to return the next
timeline's ID at the end, and used that in pg_receivexlog. I also
modifed BASE_BACKUP to return not only the start XLogRecPtr, but also
the corresponding timeline ID. Otherwise we might try to start streaming
from wrong timeline if you issue a BASE_BACKUP at the same moment the
server switches to a new timeline.

When pg_receivexlog switches timeline, what to do with the partial file
on the old timeline? When the timeline changes in the middle of a WAL
segment, the segment old the old timeline is only half-filled. For
example, when timeline changes from 1 to 2, you'll have this in pg_xlog:

000000010000000000000006
000000010000000000000007
000000010000000000000008
000000020000000000000008
00000002.history

The segment 000000010000000000000008 is only half-filled, as the
timeline changed in the middle of that segment. The beginning portion of
that file is duplicated in 000000020000000000000008, with the
timeline-changing checkpoint record right after the duplicated portion.

When we stream that with pg_receivexlog, and hit the timeline switch,
we'll have this situation in the client:

000000010000000000000006
000000010000000000000007
000000010000000000000008.partial

What to do with the partial file? One option is to rename it to
000000010000000000000008. However, if you then kill pg_receivexlog
before it has finished streaming a full segment from the new timeline,
on restart it will try to begin streaming WAL segment
000000010000000000000009, because it sees that segment
000000010000000000000008 is already completed. That'd be wrong.

The best option seems to be to just leave the .partial file in place, so
as streaming progresses, you end up with:

000000010000000000000006
000000010000000000000007
000000010000000000000008.partial
000000020000000000000008
000000020000000000000009
00000002000000000000000A.partial

It feels a bit confusing to have that old partial file there, but that
seems like the most correct solution. That file is indeed partial. This
also ensures that if the server running on timeline 1 continues to
generate new WAL, and it fills 000000010000000000000008, we won't
confuse the partial segment with that name with a full one.

- Heikki

Attachments:

teach-receivexlog-to-switch-timelines-1.patchtext/x-diff; name=teach-receivexlog-to-switch-timelines-1.patchDownload
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index e14627c..baae59d 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1418,8 +1418,10 @@ The commands accepted in walsender mode are:
      <para>
       After streaming all the WAL on a timeline that is not the latest one,
       the server will end streaming by exiting the COPY mode. When the client
-      acknowledges this by also exiting COPY mode, the server responds with a
-      CommandComplete message, and is ready to accept a new command.
+      acknowledges this by also exiting COPY mode, the server sends a
+      single-row, single-column result set indicating the next timeline in
+      this server's history. That is followed by a CommandComplete message,
+      and the server is ready to accept a new command.
      </para>
 
      <para>
@@ -1784,7 +1786,9 @@ The commands accepted in walsender mode are:
      </para>
      <para>
       The first ordinary result set contains the starting position of the
-      backup, given in XLogRecPtr format as a single column in a single row.
+      backup, in a single row with two columns. The first column contains
+      the start position given in XLogRecPtr format, and the second column
+      contains the corresponding timeline ID.
      </para>
      <para>
       The second ordinary result set has one row for each tablespace.
@@ -1827,7 +1831,9 @@ The commands accepted in walsender mode are:
       <quote>ustar interchange format</> specified in the POSIX 1003.1-2008
       standard) dump of the tablespace contents, except that the two trailing
       blocks of zeroes specified in the standard are omitted.
-      After the tar data is complete, a final ordinary result set will be sent.
+      After the tar data is complete, a final ordinary result set will be sent,
+      containing the WAL end position of the backup, in the same format as
+      the start position.
      </para>
 
      <para>
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 46379bb..ad4f316 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -545,22 +545,26 @@ tliOfPointInHistory(XLogRecPtr ptr, List *history)
 }
 
 /*
- * Returns the point in history where we branched off the given timeline.
- * Returns InvalidXLogRecPtr if the timeline is current (= we have not
- * branched off from it), and throws an error if the timeline is not part of
- * this server's history.
+ * Returns the point in history where we branched off the given timeline,
+ * and the timeline we branched to (*nextTLI). Returns InvalidXLogRecPtr if
+ * the timeline is current, ie. we have not branched off from it, and throws
+ * an error if the timeline is not part of this server's history.
  */
 XLogRecPtr
-tliSwitchPoint(TimeLineID tli, List *history)
+tliSwitchPoint(TimeLineID tli, List *history, TimeLineID *nextTLI)
 {
 	ListCell   *cell;
 
+	if (nextTLI)
+		*nextTLI = 0;
 	foreach (cell, history)
 	{
 		TimeLineHistoryEntry *tle = (TimeLineHistoryEntry *) lfirst(cell);
 
 		if (tle->tli == tli)
 			return tle->end;
+		if (nextTLI)
+			*nextTLI = tle->tli;
 	}
 
 	ereport(ERROR,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 51a515a..14e6c2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5450,7 +5450,7 @@ StartupXLOG(void)
 		 * tliSwitchPoint will throw an error if the checkpoint's timeline
 		 * is not in expectedTLEs at all.
 		 */
-		switchpoint = tliSwitchPoint(ControlFile->checkPointCopy.ThisTimeLineID, expectedTLEs);
+		switchpoint = tliSwitchPoint(ControlFile->checkPointCopy.ThisTimeLineID, expectedTLEs, NULL);
 		ereport(FATAL,
 				(errmsg("requested timeline %u is not a child of this server's history",
 						recoveryTargetTLI),
@@ -8396,12 +8396,14 @@ XLogFileNameP(TimeLineID tli, XLogSegNo segno)
  * do_pg_stop_backup() or do_pg_abort_backup().
  */
 XLogRecPtr
-do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
+do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
+				   char **labelfile)
 {
 	bool		exclusive = (labelfile == NULL);
 	bool		backup_started_in_recovery = false;
 	XLogRecPtr	checkpointloc;
 	XLogRecPtr	startpoint;
+	TimeLineID	starttli;
 	pg_time_t	stamp_time;
 	char		strfbuf[128];
 	char		xlogfilename[MAXFNAMELEN];
@@ -8543,6 +8545,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 			LWLockAcquire(ControlFileLock, LW_SHARED);
 			checkpointloc = ControlFile->checkPoint;
 			startpoint = ControlFile->checkPointCopy.redo;
+			starttli = ControlFile->checkPointCopy.ThisTimeLineID;
 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
 			LWLockRelease(ControlFileLock);
 
@@ -8676,6 +8679,8 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
 	 */
+	if (starttli_p)
+		*starttli_p = starttli;
 	return startpoint;
 }
 
@@ -8714,12 +8719,13 @@ pg_start_backup_callback(int code, Datum arg)
  * the non-exclusive backup specified by 'labelfile'.
  */
 XLogRecPtr
-do_pg_stop_backup(char *labelfile, bool waitforarchive)
+do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 {
 	bool		exclusive = (labelfile == NULL);
 	bool		backup_started_in_recovery = false;
 	XLogRecPtr	startpoint;
 	XLogRecPtr	stoppoint;
+	TimeLineID	stoptli;
 	XLogRecData rdata;
 	pg_time_t	stamp_time;
 	char		strfbuf[128];
@@ -8923,8 +8929,11 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 
 		LWLockAcquire(ControlFileLock, LW_SHARED);
 		stoppoint = ControlFile->minRecoveryPoint;
+		stoptli = ControlFile->minRecoveryPointTLI;
 		LWLockRelease(ControlFileLock);
 
+		if (stoptli_p)
+			*stoptli_p = stoptli;
 		return stoppoint;
 	}
 
@@ -8936,6 +8945,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	rdata.buffer = InvalidBuffer;
 	rdata.next = NULL;
 	stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END, &rdata);
+	stoptli = ThisTimeLineID;
 
 	/*
 	 * Force a switch to a new xlog segment file, so that the backup is valid
@@ -9051,6 +9061,8 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	/*
 	 * We're done.  As a convenience, return the ending WAL location.
 	 */
+	if (stoptli_p)
+		*stoptli_p = stoptli;
 	return stoppoint;
 }
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 96db5db..b6bb677 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -56,7 +56,7 @@ pg_start_backup(PG_FUNCTION_ARGS)
 
 	backupidstr = text_to_cstring(backupid);
 
-	startpoint = do_pg_start_backup(backupidstr, fast, NULL);
+	startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL);
 
 	snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
 			 (uint32) (startpoint >> 32), (uint32) startpoint);
@@ -82,7 +82,7 @@ pg_stop_backup(PG_FUNCTION_ARGS)
 	XLogRecPtr	stoppoint;
 	char		stopxlogstr[MAXFNAMELEN];
 
-	stoppoint = do_pg_stop_backup(NULL, true);
+	stoppoint = do_pg_stop_backup(NULL, true, NULL);
 
 	snprintf(stopxlogstr, sizeof(stopxlogstr), "%X/%X",
 			 (uint32) (stoppoint >> 32), (uint32) stoppoint);
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 2330fcc..810ff4e 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -55,7 +55,7 @@ static void SendBackupHeader(List *tablespaces);
 static void base_backup_cleanup(int code, Datum arg);
 static void perform_base_backup(basebackup_options *opt, DIR *tblspcdir);
 static void parse_basebackup_options(List *options, basebackup_options *opt);
-static void SendXlogRecPtrResult(XLogRecPtr ptr);
+static void SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli);
 static int compareWalFileNames(const void *a, const void *b);
 
 /* Was the backup currently in-progress initiated in recovery mode? */
@@ -94,13 +94,16 @@ static void
 perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 {
 	XLogRecPtr	startptr;
+	TimeLineID	starttli;
 	XLogRecPtr	endptr;
+	TimeLineID	endtli;
 	char	   *labelfile;
 
 	backup_started_in_recovery = RecoveryInProgress();
 
-	startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, &labelfile);
-	SendXlogRecPtrResult(startptr);
+	startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, &starttli,
+								  &labelfile);
+	SendXlogRecPtrResult(startptr, starttli);
 
 	PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 	{
@@ -218,7 +221,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 
-	endptr = do_pg_stop_backup(labelfile, !opt->nowait);
+	endptr = do_pg_stop_backup(labelfile, !opt->nowait, &endtli);
 
 	if (opt->includewal)
 	{
@@ -426,7 +429,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 		/* Send CopyDone message for the last tar file */
 		pq_putemptymessage('c');
 	}
-	SendXlogRecPtrResult(endptr);
+	SendXlogRecPtrResult(endptr, endtli);
 }
 
 /*
@@ -635,17 +638,15 @@ SendBackupHeader(List *tablespaces)
  * XlogRecPtr record (in text format)
  */
 static void
-SendXlogRecPtrResult(XLogRecPtr ptr)
+SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli)
 {
 	StringInfoData buf;
 	char		str[MAXFNAMELEN];
 
-	snprintf(str, sizeof(str), "%X/%X", (uint32) (ptr >> 32), (uint32) ptr);
-
 	pq_beginmessage(&buf, 'T'); /* RowDescription */
-	pq_sendint(&buf, 1, 2);		/* 1 field */
+	pq_sendint(&buf, 2, 2);		/* 2 fields */
 
-	/* Field header */
+	/* Field headers */
 	pq_sendstring(&buf, "recptr");
 	pq_sendint(&buf, 0, 4);		/* table oid */
 	pq_sendint(&buf, 0, 2);		/* attnum */
@@ -653,11 +654,25 @@ SendXlogRecPtrResult(XLogRecPtr ptr)
 	pq_sendint(&buf, -1, 2);
 	pq_sendint(&buf, 0, 4);
 	pq_sendint(&buf, 0, 2);
+
+	pq_sendstring(&buf, "tli");
+	pq_sendint(&buf, 0, 4);		/* table oid */
+	pq_sendint(&buf, 0, 2);		/* attnum */
+	pq_sendint(&buf, INT8OID, 4);	/* type oid */
+	pq_sendint(&buf, -1, 2);
+	pq_sendint(&buf, 0, 4);
+	pq_sendint(&buf, 0, 2);
 	pq_endmessage(&buf);
 
 	/* Data row */
 	pq_beginmessage(&buf, 'D');
-	pq_sendint(&buf, 1, 2);		/* number of columns */
+	pq_sendint(&buf, 2, 2);		/* number of columns */
+
+	snprintf(str, sizeof(str), "%X/%X", (uint32) (ptr >> 32), (uint32) ptr);
+	pq_sendint(&buf, strlen(str), 4);	/* length */
+	pq_sendbytes(&buf, str, strlen(str));
+
+	snprintf(str, sizeof(str), "%u", tli);
 	pq_sendint(&buf, strlen(str), 4);	/* length */
 	pq_sendbytes(&buf, str, strlen(str));
 	pq_endmessage(&buf);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index ad7d1c9..ba138e7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -117,6 +117,7 @@ static uint32 sendOff = 0;
  * history forked off from that timeline at sendTimeLineValidUpto.
  */
 static TimeLineID	sendTimeLine = 0;
+static TimeLineID	sendTimeLineNextTLI = 0;
 static bool			sendTimeLineIsHistoric = false;
 static XLogRecPtr	sendTimeLineValidUpto = InvalidXLogRecPtr;
 
@@ -449,7 +450,8 @@ StartReplication(StartReplicationCmd *cmd)
 			 * requested start location is on that timeline.
 			 */
 			timeLineHistory = readTimeLineHistory(ThisTimeLineID);
-			switchpoint = tliSwitchPoint(cmd->timeline, timeLineHistory);
+			switchpoint = tliSwitchPoint(cmd->timeline, timeLineHistory,
+										 &sendTimeLineNextTLI);
 			list_free_deep(timeLineHistory);
 
 			/*
@@ -496,8 +498,7 @@ StartReplication(StartReplicationCmd *cmd)
 	streamingDoneSending = streamingDoneReceiving = false;
 
 	/* If there is nothing to stream, don't even enter COPY mode */
-	if (!sendTimeLineIsHistoric ||
-		cmd->startpoint < sendTimeLineValidUpto)
+	if (!sendTimeLineIsHistoric || cmd->startpoint < sendTimeLineValidUpto)
 	{
 		/*
 		 * When we first start replication the standby will be behind the primary.
@@ -554,10 +555,46 @@ StartReplication(StartReplicationCmd *cmd)
 		if (walsender_ready_to_stop)
 			proc_exit(0);
 		WalSndSetState(WALSNDSTATE_STARTUP);
+
+		Assert(streamingDoneSending && streamingDoneReceiving);
+	}
+
+	/*
+	 * Copy is finished now. Send a single-row result set indicating the next
+	 * timeline.
+	 */
+	if (sendTimeLineIsHistoric)
+	{
+		char		str[11];
+		snprintf(str, sizeof(str), "%u", sendTimeLineNextTLI);
+
+		pq_beginmessage(&buf, 'T'); /* RowDescription */
+		pq_sendint(&buf, 1, 2);		/* 1 field */
+
+		/* Field header */
+		pq_sendstring(&buf, "next_tli");
+		pq_sendint(&buf, 0, 4);		/* table oid */
+		pq_sendint(&buf, 0, 2);		/* attnum */
+		/*
+		 * int8 may seem like a surprising data type for this, but in theory
+		 * int4 would not be wide enough for this, as TimeLineID is unsigned.
+		 */
+		pq_sendint(&buf, INT8OID, 4);	/* type oid */
+		pq_sendint(&buf, -1, 2);
+		pq_sendint(&buf, 0, 4);
+		pq_sendint(&buf, 0, 2);
+		pq_endmessage(&buf);
+
+		/* Data row */
+		pq_beginmessage(&buf, 'D');
+		pq_sendint(&buf, 1, 2);		/* number of columns */
+		pq_sendint(&buf, strlen(str), 4);	/* length */
+		pq_sendbytes(&buf, str, strlen(str));
+		pq_endmessage(&buf);
 	}
 
-	/* Get out of COPY mode (CommandComplete). */
-	EndCommand("COPY 0", DestRemote);
+	/* Send CommandComplete message */
+	pq_puttextmessage('C', "START_STREAMING");
 }
 
 /*
@@ -1377,8 +1414,9 @@ XLogSend(bool *caughtup)
 			List	   *history;
 
 			history = readTimeLineHistory(ThisTimeLineID);
-			sendTimeLineValidUpto = tliSwitchPoint(sendTimeLine, history);
+			sendTimeLineValidUpto = tliSwitchPoint(sendTimeLine, history, &sendTimeLineNextTLI);
 			Assert(sentPtr <= sendTimeLineValidUpto);
+			Assert(sendTimeLine < sendTimeLineNextTLI);
 			list_free_deep(history);
 
 			/* the current send pointer should be <= the switchpoint */
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index ffc8826..fabd423 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -243,7 +243,7 @@ LogStreamerMain(logstreamer_param *param)
 	if (!ReceiveXlogStream(param->bgconn, param->startptr, param->timeline,
 						   param->sysidentifier, param->xlogdir,
 						   reached_end_position, standby_message_timeout,
-						   true))
+						   NULL))
 
 		/*
 		 * Any errors will already have been reported in the function process,
@@ -1205,7 +1205,7 @@ BaseBackup(void)
 {
 	PGresult   *res;
 	char	   *sysidentifier;
-	uint32		timeline;
+	uint32		starttli;
 	char		current_path[MAXPGPATH];
 	char		escaped_label[MAXPGPATH];
 	int			i;
@@ -1244,7 +1244,6 @@ BaseBackup(void)
 		disconnect_and_exit(1);
 	}
 	sysidentifier = pg_strdup(PQgetvalue(res, 0, 0));
-	timeline = atoi(PQgetvalue(res, 0, 1));
 	PQclear(res);
 
 	/*
@@ -1276,18 +1275,30 @@ BaseBackup(void)
 				progname, PQerrorMessage(conn));
 		disconnect_and_exit(1);
 	}
-	if (PQntuples(res) != 1)
+	if (PQntuples(res) < 1)
 	{
 		fprintf(stderr, _("%s: no start point returned from server\n"),
 				progname);
 		disconnect_and_exit(1);
 	}
+	if (PQntuples(res) != 1 || PQnfields(res) < 2)
+	{
+		fprintf(stderr,
+				_("%s: server returned unexpected response to BASE_BACKUP command; got %d rows and %d fields, expected %d rows and %d fields\n"),
+				progname, PQntuples(res), PQnfields(res), 1, 2);
+		disconnect_and_exit(1);
+	}
+
 	strcpy(xlogstart, PQgetvalue(res, 0, 0));
-	if (verbose && includewal)
-		fprintf(stderr, "transaction log start point: %s\n", xlogstart);
+	starttli = atoi(PQgetvalue(res, 0, 1));
+
 	PQclear(res);
 	MemSet(xlogend, 0, sizeof(xlogend));
 
+	if (verbose && includewal)
+		fprintf(stderr, _("transaction log start point: %s on timeline %u\n"),
+				xlogstart, starttli);
+
 	/*
 	 * Get the header
 	 */
@@ -1343,7 +1354,7 @@ BaseBackup(void)
 		if (verbose)
 			fprintf(stderr, _("%s: starting background WAL receiver\n"),
 					progname);
-		StartLogStreamer(xlogstart, timeline, sysidentifier);
+		StartLogStreamer(xlogstart, starttli, sysidentifier);
 	}
 
 	/*
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index 7f2db19..392c31c 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -39,8 +39,7 @@ volatile bool time_to_abort = false;
 
 
 static void usage(void);
-static XLogRecPtr FindStreamingStart(XLogRecPtr currentpos,
-				   uint32 currenttimeline);
+static XLogRecPtr FindStreamingStart(uint32 *tli);
 static void StreamLog();
 static bool stop_streaming(XLogRecPtr segendpos, uint32 timeline,
 			   bool segment_finished);
@@ -70,14 +69,33 @@ usage(void)
 }
 
 static bool
-stop_streaming(XLogRecPtr segendpos, uint32 timeline, bool segment_finished)
+stop_streaming(XLogRecPtr xlogpos, uint32 timeline, bool segment_finished)
 {
+	static uint32 prevtimeline = 0;
+	static XLogRecPtr prevpos = InvalidXLogRecPtr;
+
+	/* we assume that we get called once at the end of each segment */
 	if (verbose && segment_finished)
 		fprintf(stderr, _("%s: finished segment at %X/%X (timeline %u)\n"),
 				progname,
-				(uint32) (segendpos >> 32), (uint32) segendpos,
+				(uint32) (xlogpos >> 32), (uint32) xlogpos,
 				timeline);
 
+	/*
+	 * Note that we report the previous, not current, position here. That's
+	 * the exact location where the timeline switch happend. After the switch,
+	 * we restart streaming from the beginning of the segment, so xlogpos can
+	 * smaller than prevpos if we just switched to new timeline.
+	 */
+	if (verbose && (prevtimeline != 0 && prevtimeline != timeline))
+		fprintf(stderr, _("%s: switched to timeline %u at %X/%X\n"),
+				progname,
+				timeline,
+				(uint32) (prevpos >> 32), (uint32) prevpos);
+
+	prevtimeline = timeline;
+	prevpos = xlogpos;
+
 	if (time_to_abort)
 	{
 		fprintf(stderr, _("%s: received interrupt signal, exiting\n"),
@@ -88,20 +106,19 @@ stop_streaming(XLogRecPtr segendpos, uint32 timeline, bool segment_finished)
 }
 
 /*
- * Determine starting location for streaming, based on:
- * 1. If there are existing xlog segments, start at the end of the last one
- *	  that is complete (size matches XLogSegSize)
- * 2. If no valid xlog exists, start from the beginning of the current
- *	  WAL segment.
+ * Determine starting location for streaming, based on any existing xlog
+ * segments in the directory. We start at the end of the last one that is
+ * complete (size matches XLogSegSize), on the timeline with highest ID.
+ *
+ * If there are no WAL files in the directory, returns InvalidXLogRecPtr.
  */
 static XLogRecPtr
-FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
+FindStreamingStart(uint32 *tli)
 {
 	DIR		   *dir;
 	struct dirent *dirent;
-	int			i;
-	bool		b;
 	XLogSegNo	high_segno = 0;
+	uint32		high_tli = 0;
 
 	dir = opendir(basedir);
 	if (dir == NULL)
@@ -120,26 +137,13 @@ FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
 					seg;
 		XLogSegNo	segno;
 
-		if (strcmp(dirent->d_name, ".") == 0 ||
-			strcmp(dirent->d_name, "..") == 0)
-			continue;
-
-		/* xlog files are always 24 characters */
-		if (strlen(dirent->d_name) != 24)
-			continue;
-
-		/* Filenames are always made out of 0-9 and A-F */
-		b = false;
-		for (i = 0; i < 24; i++)
-		{
-			if (!(dirent->d_name[i] >= '0' && dirent->d_name[i] <= '9') &&
-				!(dirent->d_name[i] >= 'A' && dirent->d_name[i] <= 'F'))
-			{
-				b = true;
-				break;
-			}
-		}
-		if (b)
+		/*
+		 * Check if the filename looks like an xlog file, or a .partial file.
+		 * Xlog files are always 24 characters, and .partial files are 32
+		 * characters.
+		 */
+		if (strlen(dirent->d_name) != 24 ||
+			!strspn(dirent->d_name, "0123456789ABCDEF") == 24)
 			continue;
 
 		/*
@@ -154,10 +158,6 @@ FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
 		}
 		segno = ((uint64) log) << 32 | seg;
 
-		/* Ignore any files that are for another timeline */
-		if (tli != currenttimeline)
-			continue;
-
 		/* Check if this is a completed segment or not */
 		snprintf(fullpath, sizeof(fullpath), "%s/%s", basedir, dirent->d_name);
 		if (stat(fullpath, &statbuf) != 0)
@@ -170,9 +170,10 @@ FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
 		if (statbuf.st_size == XLOG_SEG_SIZE)
 		{
 			/* Completed segment */
-			if (segno > high_segno)
+			if (segno > high_segno || (segno == high_segno && tli > high_tli))
 			{
 				high_segno = segno;
+				high_tli = tli;
 				continue;
 			}
 		}
@@ -199,10 +200,11 @@ FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
 
 		XLogSegNoOffsetToRecPtr(high_segno, 0, high_ptr);
 
+		*tli = high_tli;
 		return high_ptr;
 	}
 	else
-		return currentpos;
+		return InvalidXLogRecPtr;
 }
 
 /*
@@ -212,8 +214,10 @@ static void
 StreamLog(void)
 {
 	PGresult   *res;
-	uint32		timeline;
 	XLogRecPtr	startpos;
+	uint32		starttli;
+	XLogRecPtr	serverpos;
+	uint32		servertli;
 	uint32		hi,
 				lo;
 
@@ -243,7 +247,7 @@ StreamLog(void)
 				progname, PQntuples(res), PQnfields(res), 1, 3);
 		disconnect_and_exit(1);
 	}
-	timeline = atoi(PQgetvalue(res, 0, 1));
+	servertli = atoi(PQgetvalue(res, 0, 1));
 	if (sscanf(PQgetvalue(res, 0, 2), "%X/%X", &hi, &lo) != 2)
 	{
 		fprintf(stderr,
@@ -251,13 +255,19 @@ StreamLog(void)
 				progname, PQgetvalue(res, 0, 2));
 		disconnect_and_exit(1);
 	}
-	startpos = ((uint64) hi) << 32 | lo;
+	serverpos = ((uint64) hi) << 32 | lo;
 	PQclear(res);
 
 	/*
 	 * Figure out where to start streaming.
 	 */
-	startpos = FindStreamingStart(startpos, timeline);
+	;
+	startpos = FindStreamingStart(&starttli);
+	if (startpos == InvalidXLogRecPtr)
+	{
+		startpos = serverpos;
+		starttli = servertli;
+	}
 
 	/*
 	 * Always start streaming at the beginning of a segment
@@ -271,10 +281,10 @@ StreamLog(void)
 		fprintf(stderr,
 				_("%s: starting log streaming at %X/%X (timeline %u)\n"),
 				progname, (uint32) (startpos >> 32), (uint32) startpos,
-				timeline);
+				starttli);
 
-	ReceiveXlogStream(conn, startpos, timeline, NULL, basedir,
-					  stop_streaming, standby_message_timeout, false);
+	ReceiveXlogStream(conn, startpos, starttli, NULL, basedir,
+					  stop_streaming, standby_message_timeout, ".partial");
 
 	PQfinish(conn);
 }
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 88d0c13..0216f0b 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -28,19 +28,24 @@
 #include "streamutil.h"
 
 
-/* fd for currently open WAL file */
+/* fd and filename for currently open WAL file */
 static int	walfile = -1;
+static char	current_walfile_name[MAXPGPATH] = "";
+
+static bool HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
+				 char *basedir, stream_stop_callback stream_stop,
+				 int standby_message_timeout, char *partial_suffix,
+				 XLogRecPtr *stoppos);
 
 /*
- * Open a new WAL file in the specified directory. Store the name
- * (not including the full directory) in namebuf. Assumes there is
- * enough room in this buffer...
+ * Open a new WAL file in the specified directory.
  *
- * The file will be padded to 16Mb with zeroes.
+ * The file will be padded to 16Mb with zeroes. The base filename (without
+ * partial_suffix) is stored in current_walfile_name.
  */
-static int
+static bool
 open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
-			 char *namebuf)
+			 char *partial_suffix)
 {
 	int			f;
 	char		fn[MAXPGPATH];
@@ -50,16 +55,17 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 	XLogSegNo	segno;
 
 	XLByteToSeg(startpoint, segno);
-	XLogFileName(namebuf, timeline, segno);
+	XLogFileName(current_walfile_name, timeline, segno);
 
-	snprintf(fn, sizeof(fn), "%s/%s.partial", basedir, namebuf);
+	snprintf(fn, sizeof(fn), "%s/%s%s", basedir, current_walfile_name,
+			 partial_suffix ? partial_suffix : "");
 	f = open(fn, O_WRONLY | O_CREAT | PG_BINARY, S_IRUSR | S_IWUSR);
 	if (f == -1)
 	{
 		fprintf(stderr,
 				_("%s: could not open transaction log file \"%s\": %s\n"),
 				progname, fn, strerror(errno));
-		return -1;
+		return false;
 	}
 
 	/*
@@ -72,17 +78,21 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 				_("%s: could not stat transaction log file \"%s\": %s\n"),
 				progname, fn, strerror(errno));
 		close(f);
-		return -1;
+		return false;
 	}
 	if (statbuf.st_size == XLogSegSize)
-		return f;				/* File is open and ready to use */
+	{
+		/* File is open and ready to use */
+		walfile = f;
+		return true;
+	}
 	if (statbuf.st_size != 0)
 	{
 		fprintf(stderr,
 				_("%s: transaction log file \"%s\" has %d bytes, should be 0 or %d\n"),
 				progname, fn, (int) statbuf.st_size, XLogSegSize);
 		close(f);
-		return -1;
+		return false;
 	}
 
 	/* New, empty, file. So pad it to 16Mb with zeroes */
@@ -97,7 +107,7 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 			free(zerobuf);
 			close(f);
 			unlink(fn);
-			return -1;
+			return false;
 		}
 	}
 	free(zerobuf);
@@ -108,42 +118,45 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir,
 				_("%s: could not seek to beginning of transaction log file \"%s\": %s\n"),
 				progname, fn, strerror(errno));
 		close(f);
-		return -1;
+		return false;
 	}
-	return f;
+	walfile = f;
+	return true;
 }
 
 /*
- * Close the current WAL file, and rename it to the correct filename if it's
- * complete.
- *
- * If segment_complete is true, rename the current WAL file even if we've not
- * completed writing the whole segment.
+ * Close the current WAL file (if open), and rename it to the correct
+ * filename if it's complete. On failure, prints an error message to stderr
+ * and returns false, otherwise returns true.
  */
 static bool
-close_walfile(char *basedir, char *walname, bool segment_complete)
+close_walfile(char *basedir, char *partial_suffix)
 {
-	off_t		currpos = lseek(walfile, 0, SEEK_CUR);
+	off_t		currpos;
 
+	if (walfile == -1)
+		return true;
+
+	currpos = lseek(walfile, 0, SEEK_CUR);
 	if (currpos == -1)
 	{
 		fprintf(stderr,
 			 _("%s: could not determine seek position in file \"%s\": %s\n"),
-				progname, walname, strerror(errno));
+				progname, current_walfile_name, strerror(errno));
 		return false;
 	}
 
 	if (fsync(walfile) != 0)
 	{
 		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
-				progname, walname, strerror(errno));
+				progname, current_walfile_name, strerror(errno));
 		return false;
 	}
 
 	if (close(walfile) != 0)
 	{
 		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
-				progname, walname, strerror(errno));
+				progname, current_walfile_name, strerror(errno));
 		walfile = -1;
 		return false;
 	}
@@ -153,24 +166,24 @@ close_walfile(char *basedir, char *walname, bool segment_complete)
 	 * Rename the .partial file only if we've completed writing the whole
 	 * segment or segment_complete is true.
 	 */
-	if (currpos == XLOG_SEG_SIZE || segment_complete)
+	if (currpos == XLOG_SEG_SIZE && partial_suffix)
 	{
 		char		oldfn[MAXPGPATH];
 		char		newfn[MAXPGPATH];
 
-		snprintf(oldfn, sizeof(oldfn), "%s/%s.partial", basedir, walname);
-		snprintf(newfn, sizeof(newfn), "%s/%s", basedir, walname);
+		snprintf(oldfn, sizeof(oldfn), "%s/%s%s", basedir, current_walfile_name, partial_suffix);
+		snprintf(newfn, sizeof(newfn), "%s/%s", basedir, current_walfile_name);
 		if (rename(oldfn, newfn) != 0)
 		{
 			fprintf(stderr, _("%s: could not rename file \"%s\": %s\n"),
-					progname, walname, strerror(errno));
+					progname, current_walfile_name, strerror(errno));
 			return false;
 		}
 	}
-	else
+	else if (partial_suffix)
 		fprintf(stderr,
 				_("%s: not renaming \"%s\", segment is not complete\n"),
-				progname, walname);
+				progname, current_walfile_name);
 
 	return true;
 }
@@ -234,6 +247,123 @@ localTimestampDifferenceExceeds(int64 start_time,
 }
 
 /*
+ * Check if a timeline history file exists.
+ */
+static bool
+existsTimeLineHistoryFile(char *basedir, TimeLineID tli)
+{
+	char		path[MAXPGPATH];
+	char		histfname[MAXFNAMELEN];
+	int			fd;
+
+	/*
+	 * Timeline 1 never has a history file. We treat that as if it existed,
+	 * since we never need to stream it.
+	 */
+	if (tli == 1)
+		return true;
+
+	TLHistoryFileName(histfname, tli);
+
+	snprintf(path, sizeof(path), "%s/%s", basedir, histfname);
+
+	fd = open(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+	{
+		if (errno != ENOENT)
+			fprintf(stderr, _("%s: could not open timeline history file \"%s\": %s"),
+					progname, path, strerror(errno));
+		return false;
+	}
+	else
+	{
+		close(fd);
+		return true;
+	}
+}
+
+static bool
+writeTimeLineHistoryFile(char *basedir, TimeLineID tli, char *filename, char *content)
+{
+	int			size = strlen(content);
+	char		path[MAXPGPATH];
+	char		tmppath[MAXPGPATH];
+	char		histfname[MAXFNAMELEN];
+	int			fd;
+
+	/*
+	 * Check that the server's idea of how timeline history files should be
+	 * named matches ours.
+	 */
+	TLHistoryFileName(histfname, tli);
+	if (strcmp(histfname, filename) != 0)
+	{
+		fprintf(stderr, _("%s: server reported unexpected history file name for timeline %u: %s"),
+				progname, tli, filename);
+		return false;
+	}
+
+	/*
+	 * Write into a temp file name.
+	 */
+	snprintf(tmppath, MAXPGPATH,  "%s.tmp", path);
+
+	unlink(tmppath);
+
+	fd = open(tmppath, O_WRONLY | O_CREAT | PG_BINARY, S_IRUSR | S_IWUSR);
+	if (fd < 0)
+	{
+		fprintf(stderr, _("%s: could not create timeline history file \"%s\": %s"),
+				progname, tmppath, strerror(errno));
+		return false;
+	}
+
+	errno = 0;
+	if ((int) write(fd, content, size) != size)
+	{
+		int			save_errno = errno;
+
+		/*
+		 * If we fail to make the file, delete it to release disk space
+		 */
+		unlink(tmppath);
+		errno = save_errno;
+
+		fprintf(stderr, _("%s: could not write timeline history file \"%s\": %s"),
+				progname, tmppath, strerror(errno));
+		return false;
+	}
+
+	if (fsync(fd) != 0)
+	{
+		fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+				progname, tmppath, strerror(errno));
+		return false;
+	}
+
+	if (close(fd) != 0)
+	{
+		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
+				progname, tmppath, strerror(errno));
+		return false;
+	}
+
+	/*
+	 * Now move the completed history file into place with its final name.
+	 */
+
+	snprintf(path, sizeof(path), "%s/%s", basedir, histfname);
+	if (rename(tmppath, path) < 0)
+	{
+		fprintf(stderr, _("%s: could not rename file \"%s\" to \"%s\": %s\n"),
+				progname, tmppath, path, strerror(errno));
+		return false;
+	}
+
+	return true;
+}
+
+/*
  * Converts an int64 to network byte order.
  */
 static void
@@ -314,7 +444,8 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
  * (by sending an extra IDENTIFY_SYSTEM command)
  *
  * All received segments will be written to the directory
- * specified by basedir.
+ * specified by basedir. This will also fetch any missing timeline history
+ * files.
  *
  * The stream_stop callback will be called every time data
  * is received, and whenever a segment is completed. If it returns
@@ -327,20 +458,22 @@ sendFeedback(PGconn *conn, XLogRecPtr blockpos, int64 now, bool replyRequested)
  * This message will only contain the write location, and never
  * flush or replay.
  *
+ * If 'partial_suffix' is not NULL, files are initially created with the
+ * given suffix, and the suffix is removed once the file is finished. That
+ * allows you to tell the difference between partial and completed files,
+ * so that you can continue later where you left.
+ *
  * Note: The log position *must* be at a log segment start!
  */
 bool
 ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 				  char *sysidentifier, char *basedir,
 				  stream_stop_callback stream_stop,
-				  int standby_message_timeout, bool rename_partial)
+				  int standby_message_timeout, char *partial_suffix)
 {
 	char		query[128];
-	char		current_walfile_name[MAXPGPATH];
 	PGresult   *res;
-	char	   *copybuf = NULL;
-	int64		last_status = -1;
-	XLogRecPtr	blockpos = InvalidXLogRecPtr;
+	XLogRecPtr	stoppos;
 
 	/*
 	 * The message format used in streaming replication changed in 9.3, so we
@@ -359,7 +492,7 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 
 	if (sysidentifier != NULL)
 	{
-		/* Validate system identifier and timeline hasn't changed */
+		/* Validate system identifier hasn't changed */
 		res = PQexec(conn, "IDENTIFY_SYSTEM");
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 		{
@@ -385,33 +518,176 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			PQclear(res);
 			return false;
 		}
-		if (timeline != atoi(PQgetvalue(res, 0, 1)))
+		if (timeline > atoi(PQgetvalue(res, 0, 1)))
 		{
 			fprintf(stderr,
-					_("%s: timeline does not match between base backup and streaming connection\n"),
-					progname);
+					_("%s: starting timeline %u is not present in the server\n"),
+					progname, timeline);
 			PQclear(res);
 			return false;
 		}
 		PQclear(res);
 	}
 
-	/* Initiate the replication stream at specified location */
-	snprintf(query, sizeof(query), "START_REPLICATION %X/%X",
-			 (uint32) (startpos >> 32), (uint32) startpos);
-	res = PQexec(conn, query);
-	if (PQresultStatus(res) != PGRES_COPY_BOTH)
+	while (1)
 	{
-		fprintf(stderr, _("%s: could not send replication command \"%s\": %s"),
-				progname, "START_REPLICATION", PQresultErrorMessage(res));
+		/*
+		 * Fetch the timeline history file for this timeline, if we don't
+		 * have it already.
+		 */
+		if (!existsTimeLineHistoryFile(basedir, timeline))
+		{
+			snprintf(query, sizeof(query), "TIMELINE_HISTORY %u", timeline);
+			res = PQexec(conn, query);
+			if (PQresultStatus(res) != PGRES_TUPLES_OK)
+			{
+				/* FIXME: we might send it ok, but get an error */
+				fprintf(stderr, _("%s: could not send replication command \"%s\": %s"),
+						progname, "TIMELINE_HISTORY", PQresultErrorMessage(res));
+				PQclear(res);
+				return false;
+			}
+
+			/*
+			 * The response to TIMELINE_HISTORY is a single row result set
+			 * with two fields: filename and content
+			 */
+			if (PQnfields(res) != 2 || PQntuples(res) != 1)
+			{
+				fprintf(stderr,
+						_("%s: unexpected response to TIMELINE_HISTORY command: got %d rows and %d fields, expected %d rows and %d fields\n"),
+					progname, PQntuples(res), PQnfields(res), 1, 2);
+			}
+
+			/* Write the history file to disk */
+			writeTimeLineHistoryFile(basedir, timeline,
+									 PQgetvalue(res, 0, 0),
+									 PQgetvalue(res, 0, 1));
+
+			PQclear(res);
+		}
+
+		/*
+		 * Before we start streaming from the requested location, check
+		 * if the callback tells us to stop here.
+		 */
+		if (stream_stop(startpos, timeline, false))
+			return true;
+
+		/* Initiate the replication stream at specified location */
+		snprintf(query, sizeof(query), "START_REPLICATION %X/%X TIMELINE %u",
+				 (uint32) (startpos >> 32), (uint32) startpos,
+				 timeline);
+		res = PQexec(conn, query);
+		if (PQresultStatus(res) != PGRES_COPY_BOTH)
+		{
+			fprintf(stderr, _("%s: could not send replication command \"%s\": %s"),
+					progname, "START_REPLICATION", PQresultErrorMessage(res));
+			PQclear(res);
+			return false;
+		}
 		PQclear(res);
-		return false;
+
+		/* Stream the WAL */
+		if (!HandleCopyStream(conn, startpos, timeline, basedir, stream_stop,
+							  standby_message_timeout, partial_suffix,
+							  &stoppos))
+			goto error;
+
+		/*
+		 * Streaming finished.
+		 *
+		 * There are two possible reasons for that: a controlled shutdown,
+		 * or we reached the end of the current timeline. In case of
+		 * end-of-timeline, the server sends a result set after Copy has
+		 * finished, containing the next timeline's ID. Read that, and
+		 * restart streaming from the next timeline.
+		 */
+
+		res = PQgetResult(conn);
+		if (PQresultStatus(res) == PGRES_TUPLES_OK)
+		{
+			/*
+			 * End-of-timeline. Read the next timeline's ID.
+			 */
+			uint32		newtimeline;
+
+			newtimeline = atoi(PQgetvalue(res, 0, 0));
+			PQclear(res);
+
+			if (newtimeline <= timeline)
+			{
+				/* shouldn't happen */
+				fprintf(stderr,
+						"server reported unexpected next timeline %u, following timeline %u\n",
+						newtimeline, timeline);
+				goto error;
+			}
+
+			/* Read the final result, which should be CommandComplete. */
+			res = PQgetResult(conn);
+			if (PQresultStatus(res) != PGRES_COMMAND_OK)
+			{
+				fprintf(stderr,
+						_("%s: unexpected termination of replication stream: %s"),
+						progname, PQresultErrorMessage(res));
+				goto error;
+			}
+			PQclear(res);
+
+			/*
+			 * Loop back to start streaming from the new timeline.
+			 * Always start streaming at the beginning of a segment.
+			 */
+			timeline = newtimeline;
+			startpos = stoppos - (stoppos % XLOG_SEG_SIZE);
+			continue;
+		}
+		else if (PQresultStatus(res) == PGRES_COMMAND_OK)
+		{
+			/*
+			 * End of replication (ie. controlled shut down of the server).
+			 *
+			 * Check if the callback thinks it's OK to stop here. If not,
+			 * complain.
+			 */
+			if (stream_stop(stoppos, timeline, false))
+				return true;
+			else
+			{
+				fprintf(stderr, _("%s: replication stream was terminated before stop point\n"),
+						progname);
+				goto error;
+			}
+		}
 	}
-	PQclear(res);
 
-	/*
-	 * Receive the actual xlog data
-	 */
+error:
+	if (walfile != -1 && close(walfile) != 0)
+		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
+				progname, current_walfile_name, strerror(errno));
+	walfile = -1;
+	return false;
+}
+
+/*
+ * The main loop of ReceiveXLogStream. Handles the COPY stream after
+ * initiating streaming with the START_STREAMING command.
+ *
+ * If the COPY ends normally, returns true and sets *stoppos to the last
+ * byte written. On error, returns false.
+ */
+static bool
+HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
+				 char *basedir, stream_stop_callback stream_stop,
+				 int standby_message_timeout, char *partial_suffix,
+				 XLogRecPtr *stoppos)
+{
+	char	   *copybuf = NULL;
+	int64		last_status = -1;
+	XLogRecPtr	blockpos = startpos;
+	bool		still_sending = true;
+
 	while (1)
 	{
 		int			r;
@@ -430,20 +706,27 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 		/*
 		 * Check if we should continue streaming, or abort at this point.
 		 */
-		if (stream_stop && stream_stop(blockpos, timeline, false))
+		if (still_sending && stream_stop(blockpos, timeline, false))
 		{
-			if (walfile != -1 && !close_walfile(basedir, current_walfile_name,
-												rename_partial))
+			if (!close_walfile(basedir, partial_suffix))
+			{
 				/* Potential error message is written by close_walfile */
 				goto error;
-			return true;
+			}
+			if (PQputCopyEnd(conn, NULL) <= 0 || PQflush(conn))
+			{
+				fprintf(stderr, _("%s: could not send copy-end packet: %s"),
+						progname, PQerrorMessage(conn));
+				goto error;
+			}
+			still_sending = false;
 		}
 
 		/*
 		 * Potentially send a status message to the master
 		 */
 		now = localGetCurrentTimestamp();
-		if (standby_message_timeout > 0 &&
+		if (still_sending && standby_message_timeout > 0 &&
 			localTimestampDifferenceExceeds(last_status, now,
 											standby_message_timeout))
 		{
@@ -457,9 +740,8 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 		if (r == 0)
 		{
 			/*
-			 * In async mode, and no data available. We block on reading but
-			 * not more than the specified timeout, so that we can send a
-			 * response back to the client.
+			 * No data available. Wait for some to appear, but not longer
+			 * than the specified timeout, so that we can ping the server.
 			 */
 			fd_set		input_mask;
 			struct timeval timeout;
@@ -467,7 +749,7 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 
 			FD_ZERO(&input_mask);
 			FD_SET(PQsocket(conn), &input_mask);
-			if (standby_message_timeout)
+			if (standby_message_timeout && still_sending)
 			{
 				int64		targettime;
 				long		secs;
@@ -493,8 +775,8 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			{
 				/*
 				 * Got a timeout or signal. Continue the loop and either
-				 * deliver a status packet to the server or just go back into
-				 * blocking.
+				 * deliver a status packet to the server or just go back
+				 * into blocking.
 				 */
 				continue;
 			}
@@ -515,8 +797,31 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			continue;
 		}
 		if (r == -1)
-			/* End of copy stream */
-			break;
+		{
+			/*
+			 * The server closed its end of the copy stream. Close ours
+			 * if we haven't done so already, and exit.
+			 */
+			if (still_sending)
+			{
+				if (!close_walfile(basedir, partial_suffix))
+				{
+					/* Error message written in close_walfile() */
+					goto error;
+				}
+				if (PQputCopyEnd(conn, NULL) <= 0 || PQflush(conn))
+				{
+					fprintf(stderr, _("%s: could not send copy-end packet: %s"),
+							progname, PQerrorMessage(conn));
+					goto error;
+				}
+				still_sending = false;
+			}
+			if (copybuf != NULL)
+				PQfreemem(copybuf);
+			*stoppos = blockpos;
+			return true;
+		}
 		if (r == -2)
 		{
 			fprintf(stderr, _("%s: could not read COPY data: %s"),
@@ -548,174 +853,148 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			replyRequested = copybuf[pos];
 
 			/* If the server requested an immediate reply, send one. */
-			if (replyRequested)
+			if (replyRequested && still_sending)
 			{
 				now = localGetCurrentTimestamp();
 				if (!sendFeedback(conn, blockpos, now, false))
 					goto error;
 				last_status = now;
 			}
-			continue;
 		}
-		else if (copybuf[0] != 'w')
+		else if (copybuf[0] == 'w')
 		{
-			fprintf(stderr, _("%s: unrecognized streaming header: \"%c\"\n"),
-					progname, copybuf[0]);
-			goto error;
-		}
-
-		/*
-		 * Read the header of the XLogData message, enclosed in the CopyData
-		 * message. We only need the WAL location field (dataStart), the rest
-		 * of the header is ignored.
-		 */
-		hdr_len = 1;	/* msgtype 'w' */
-		hdr_len += 8;	/* dataStart */
-		hdr_len += 8;	/* walEnd */
-		hdr_len += 8;	/* sendTime */
-		if (r < hdr_len + 1)
-		{
-			fprintf(stderr, _("%s: streaming header too small: %d\n"),
-					progname, r);
-			goto error;
-		}
-		blockpos = recvint64(&copybuf[1]);
-
-		/* Extract WAL location for this block */
-		xlogoff = blockpos % XLOG_SEG_SIZE;
+			/*
+			 * Once we've decided we don't want to receive any more, just
+			 * ignore any subsequent XLogData messages.
+			 */
+			if (!still_sending)
+				continue;
 
-		/*
-		 * Verify that the initial location in the stream matches where we
-		 * think we are.
-		 */
-		if (walfile == -1)
-		{
-			/* No file open yet */
-			if (xlogoff != 0)
-			{
-				fprintf(stderr,
-						_("%s: received transaction log record for offset %u with no file open\n"),
-						progname, xlogoff);
-				goto error;
-			}
-		}
-		else
-		{
-			/* More data in existing segment */
-			/* XXX: store seek value don't reseek all the time */
-			if (lseek(walfile, 0, SEEK_CUR) != xlogoff)
+			/*
+			 * Read the header of the XLogData message, enclosed in the
+			 * CopyData message. We only need the WAL location field
+			 * (dataStart), the rest of the header is ignored.
+			 */
+			hdr_len = 1;	/* msgtype 'w' */
+			hdr_len += 8;	/* dataStart */
+			hdr_len += 8;	/* walEnd */
+			hdr_len += 8;	/* sendTime */
+			if (r < hdr_len + 1)
 			{
-				fprintf(stderr,
-						_("%s: got WAL data offset %08x, expected %08x\n"),
-						progname, xlogoff, (int) lseek(walfile, 0, SEEK_CUR));
+				fprintf(stderr, _("%s: streaming header too small: %d\n"),
+						progname, r);
 				goto error;
 			}
-		}
+			blockpos = recvint64(&copybuf[1]);
 
-		bytes_left = r - hdr_len;
-		bytes_written = 0;
-
-		while (bytes_left)
-		{
-			int			bytes_to_write;
+			/* Extract WAL location for this block */
+			xlogoff = blockpos % XLOG_SEG_SIZE;
 
 			/*
-			 * If crossing a WAL boundary, only write up until we reach
-			 * XLOG_SEG_SIZE.
+			 * Verify that the initial location in the stream matches where
+			 * we think we are.
 			 */
-			if (xlogoff + bytes_left > XLOG_SEG_SIZE)
-				bytes_to_write = XLOG_SEG_SIZE - xlogoff;
-			else
-				bytes_to_write = bytes_left;
-
 			if (walfile == -1)
 			{
-				walfile = open_walfile(blockpos, timeline,
-									   basedir, current_walfile_name);
-				if (walfile == -1)
-					/* Error logged by open_walfile */
+				/* No file open yet */
+				if (xlogoff != 0)
+				{
+					fprintf(stderr,
+							_("%s: received transaction log record for offset %u with no file open\n"),
+							progname, xlogoff);
 					goto error;
+				}
 			}
-
-			if (write(walfile,
-					  copybuf + hdr_len + bytes_written,
-					  bytes_to_write) != bytes_to_write)
+			else
 			{
-				fprintf(stderr,
-				  _("%s: could not write %u bytes to WAL file \"%s\": %s\n"),
-						progname, bytes_to_write, current_walfile_name,
-						strerror(errno));
-				goto error;
+				/* More data in existing segment */
+				/* XXX: store seek value don't reseek all the time */
+				if (lseek(walfile, 0, SEEK_CUR) != xlogoff)
+				{
+					fprintf(stderr,
+							_("%s: got WAL data offset %08x, expected %08x\n"),
+							progname, xlogoff, (int) lseek(walfile, 0, SEEK_CUR));
+					goto error;
+				}
 			}
 
-			/* Write was successful, advance our position */
-			bytes_written += bytes_to_write;
-			bytes_left -= bytes_to_write;
-			blockpos += bytes_to_write;
-			xlogoff += bytes_to_write;
+			bytes_left = r - hdr_len;
+			bytes_written = 0;
 
-			/* Did we reach the end of a WAL segment? */
-			if (blockpos % XLOG_SEG_SIZE == 0)
+			while (bytes_left)
 			{
-				if (!close_walfile(basedir, current_walfile_name, false))
-					/* Error message written in close_walfile() */
-					goto error;
+				int			bytes_to_write;
 
-				xlogoff = 0;
+				/*
+				 * If crossing a WAL boundary, only write up until we reach
+				 * XLOG_SEG_SIZE.
+				 */
+				if (xlogoff + bytes_left > XLOG_SEG_SIZE)
+					bytes_to_write = XLOG_SEG_SIZE - xlogoff;
+				else
+					bytes_to_write = bytes_left;
 
-				if (stream_stop != NULL)
+				if (walfile == -1)
 				{
-					/*
-					 * Callback when the segment finished, and return if it
-					 * told us to.
-					 */
-					if (stream_stop(blockpos, timeline, true))
-						return true;
+					if (!open_walfile(blockpos, timeline,
+									  basedir, partial_suffix))
+					{
+						/* Error logged by open_walfile */
+						goto error;
+					}
 				}
-			}
-		}
-		/* No more data left to write, start receiving next copy packet */
-	}
 
-	/*
-	 * The only way to get out of the loop is if the server shut down the
-	 * replication stream. If it's a controlled shutdown, the server will send
-	 * a shutdown message, and we'll return the latest xlog location that has
-	 * been streamed.
-	 */
+				if (write(walfile,
+						  copybuf + hdr_len + bytes_written,
+						  bytes_to_write) != bytes_to_write)
+				{
+					fprintf(stderr,
+							_("%s: could not write %u bytes to WAL file \"%s\": %s\n"),
+							progname, bytes_to_write, current_walfile_name,
+							strerror(errno));
+					goto error;
+				}
 
-	res = PQgetResult(conn);
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-	{
-		fprintf(stderr,
-				_("%s: unexpected termination of replication stream: %s"),
-				progname, PQresultErrorMessage(res));
-		goto error;
-	}
-	PQclear(res);
+				/* Write was successful, advance our position */
+				bytes_written += bytes_to_write;
+				bytes_left -= bytes_to_write;
+				blockpos += bytes_to_write;
+				xlogoff += bytes_to_write;
 
-	/* Complain if we've not reached stop point yet */
-	if (stream_stop != NULL && !stream_stop(blockpos, timeline, false))
-	{
-		fprintf(stderr, _("%s: replication stream was terminated before stop point\n"),
-				progname);
-		goto error;
+				/* Did we reach the end of a WAL segment? */
+				if (blockpos % XLOG_SEG_SIZE == 0)
+				{
+					if (!close_walfile(basedir, partial_suffix))
+						/* Error message written in close_walfile() */
+						goto error;
+
+					xlogoff = 0;
+
+					if (still_sending && stream_stop(blockpos, timeline, false))
+					{
+						if (PQputCopyEnd(conn, NULL) <= 0 || PQflush(conn))
+						{
+							fprintf(stderr, _("%s: could not send copy-end packet: %s"),
+									progname, PQerrorMessage(conn));
+							goto error;
+						}
+						still_sending = false;
+						break; /* ignore the rest of this XLogData packet */
+					}
+				}
+			}
+			/* No more data left to write, receive next copy packet */
+		}
+		else
+		{
+			fprintf(stderr, _("%s: unrecognized streaming header: \"%c\"\n"),
+					progname, copybuf[0]);
+			goto error;
+		}
 	}
 
-	if (copybuf != NULL)
-		PQfreemem(copybuf);
-	if (walfile != -1 && close(walfile) != 0)
-		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
-				progname, current_walfile_name, strerror(errno));
-	walfile = -1;
-	return true;
-
 error:
 	if (copybuf != NULL)
 		PQfreemem(copybuf);
-	if (walfile != -1 && close(walfile) != 0)
-		fprintf(stderr, _("%s: could not close file \"%s\": %s\n"),
-				progname, current_walfile_name, strerror(errno));
-	walfile = -1;
 	return false;
 }
diff --git a/src/bin/pg_basebackup/receivelog.h b/src/bin/pg_basebackup/receivelog.h
index 7176a68..6d2a0fb 100644
--- a/src/bin/pg_basebackup/receivelog.h
+++ b/src/bin/pg_basebackup/receivelog.h
@@ -4,7 +4,9 @@
  * Called before trying to read more data or when a segment is
  * finished. Return true to stop streaming.
  */
-typedef bool (*stream_stop_callback) (XLogRecPtr segendpos, uint32 timeline, bool segment_finished);
+typedef bool (*stream_stop_callback) (XLogRecPtr segendpos, uint32 timeline,
+									  bool segment_switch);
+
 
 extern bool ReceiveXlogStream(PGconn *conn,
 				  XLogRecPtr startpos,
@@ -13,4 +15,4 @@ extern bool ReceiveXlogStream(PGconn *conn,
 				  char *basedir,
 				  stream_stop_callback stream_stop,
 				  int standby_message_timeout,
-				  bool rename_partial);
+				  char *partial_suffix);
diff --git a/src/include/access/timeline.h b/src/include/access/timeline.h
index dd16f97..a161ebe 100644
--- a/src/include/access/timeline.h
+++ b/src/include/access/timeline.h
@@ -37,6 +37,6 @@ extern void writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 extern void writeTimeLineHistoryFile(TimeLineID tli, char *content, int size);
 extern bool tliInHistory(TimeLineID tli, List *expectedTLIs);
 extern TimeLineID tliOfPointInHistory(XLogRecPtr ptr, List *history);
-extern XLogRecPtr tliSwitchPoint(TimeLineID tli, List *history);
+extern XLogRecPtr tliSwitchPoint(TimeLineID tli, List *history, TimeLineID *nextTLI);
 
 #endif   /* TIMELINE_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 885b5fc..fbedbf9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -317,8 +317,8 @@ extern void SetWalWriterSleeping(bool sleeping);
 /*
  * Starting/stopping a base backup
  */
-extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile);
-extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive);
+extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p, char **labelfile);
+extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p);
 extern void do_pg_abort_backup(void);
 
 /* File path names (all relative to $PGDATA) */
#2Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Teaching pg_receivexlog to follow timeline switches

On Tue, Jan 15, 2013 at 11:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Now that a standby server can follow timeline switches through streaming
replication, we should do teach pg_receivexlog to do the same. Patch
attached.

I made one change to the way START_STREAMING command works, to better
support this. When a standby server reaches the timeline it's streaming from
the master, it stops streaming, fetches any missing timeline history files,
and parses the history file of the latest timeline to figure out where to
continue. However, I don't want to parse timeline history files in
pg_receivexlog. Better to keep it simple. So instead, I modified the
server-side code for START_STREAMING to return the next timeline's ID at the
end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to return
not only the start XLogRecPtr, but also the corresponding timeline ID.
Otherwise we might try to start streaming from wrong timeline if you issue a
BASE_BACKUP at the same moment the server switches to a new timeline.

When pg_receivexlog switches timeline, what to do with the partial file on
the old timeline? When the timeline changes in the middle of a WAL segment,
the segment old the old timeline is only half-filled. For example, when
timeline changes from 1 to 2, you'll have this in pg_xlog:

000000010000000000000006
000000010000000000000007
000000010000000000000008
000000020000000000000008
00000002.history

The segment 000000010000000000000008 is only half-filled, as the timeline
changed in the middle of that segment. The beginning portion of that file is
duplicated in 000000020000000000000008, with the timeline-changing
checkpoint record right after the duplicated portion.

When we stream that with pg_receivexlog, and hit the timeline switch, we'll
have this situation in the client:

000000010000000000000006
000000010000000000000007
000000010000000000000008.partial

What to do with the partial file? One option is to rename it to
000000010000000000000008. However, if you then kill pg_receivexlog before it
has finished streaming a full segment from the new timeline, on restart it
will try to begin streaming WAL segment 000000010000000000000009, because it
sees that segment 000000010000000000000008 is already completed. That'd be
wrong.

Can't we rename .partial file safely after we receive a full segment
of the WAL file
with new timeline and the same logid/segmentid?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#2)
Re: Teaching pg_receivexlog to follow timeline switches

On 15.01.2013 20:22, Fujii Masao wrote:

On Tue, Jan 15, 2013 at 11:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Now that a standby server can follow timeline switches through streaming
replication, we should do teach pg_receivexlog to do the same. Patch
attached.

I made one change to the way START_STREAMING command works, to better
support this. When a standby server reaches the timeline it's streaming from
the master, it stops streaming, fetches any missing timeline history files,
and parses the history file of the latest timeline to figure out where to
continue. However, I don't want to parse timeline history files in
pg_receivexlog. Better to keep it simple. So instead, I modified the
server-side code for START_STREAMING to return the next timeline's ID at the
end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to return
not only the start XLogRecPtr, but also the corresponding timeline ID.
Otherwise we might try to start streaming from wrong timeline if you issue a
BASE_BACKUP at the same moment the server switches to a new timeline.

When pg_receivexlog switches timeline, what to do with the partial file on
the old timeline? When the timeline changes in the middle of a WAL segment,
the segment old the old timeline is only half-filled. For example, when
timeline changes from 1 to 2, you'll have this in pg_xlog:

000000010000000000000006
000000010000000000000007
000000010000000000000008
000000020000000000000008
00000002.history

The segment 000000010000000000000008 is only half-filled, as the timeline
changed in the middle of that segment. The beginning portion of that file is
duplicated in 000000020000000000000008, with the timeline-changing
checkpoint record right after the duplicated portion.

When we stream that with pg_receivexlog, and hit the timeline switch, we'll
have this situation in the client:

000000010000000000000006
000000010000000000000007
000000010000000000000008.partial

What to do with the partial file? One option is to rename it to
000000010000000000000008. However, if you then kill pg_receivexlog before it
has finished streaming a full segment from the new timeline, on restart it
will try to begin streaming WAL segment 000000010000000000000009, because it
sees that segment 000000010000000000000008 is already completed. That'd be
wrong.

Can't we rename .partial file safely after we receive a full segment
of the WAL file
with new timeline and the same logid/segmentid?

I'd prefer to leave the .partial suffix in place, as the segment really
isn't complete. It doesn't make a difference when you recover to the
latest timeline, but if you have a more complicated scenario with
multiple timelines that are still "alive", ie. there's a server still
actively generating WAL on that timeline, you'll easily get confused.

As an example, imagine that you have a master server, and one standby.
You maintain a WAL archive for backup purposes with pg_receivexlog,
connected to the standby. Now, for some reason, you get a split-brain
situation and the standby server is promoted with new timeline 2, while
the real master is still running. The DBA notices the problem, and kills
the standby and pg_receivexlog. He deletes the XLOG files belonging to
timeline 2 in pg_receivexlog's target directory, and re-points
pg_recevexlog to the master while he re-builds the standby server from
backup. At that point, pg_receivexlog will start streaming from the end
of the zero-padded segment, not knowing that it was partial, and you
have a hole in the archived WAL stream. Oops.

The DBA could avoid that by also removing the last WAL segment on
timeline 1, the one that was partial. But it's really not obvious that
there's anything wrong with that segment. Keeping the .partial suffix
makes it clear.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: Teaching pg_receivexlog to follow timeline switches

On Thu, Jan 17, 2013 at 1:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 15.01.2013 20:22, Fujii Masao wrote:

On Tue, Jan 15, 2013 at 11:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Now that a standby server can follow timeline switches through streaming
replication, we should do teach pg_receivexlog to do the same. Patch
attached.

I made one change to the way START_STREAMING command works, to better
support this. When a standby server reaches the timeline it's streaming
from
the master, it stops streaming, fetches any missing timeline history
files,
and parses the history file of the latest timeline to figure out where to
continue. However, I don't want to parse timeline history files in
pg_receivexlog. Better to keep it simple. So instead, I modified the
server-side code for START_STREAMING to return the next timeline's ID at
the
end, and used that in pg_receivexlog. I also modifed BASE_BACKUP to
return
not only the start XLogRecPtr, but also the corresponding timeline ID.
Otherwise we might try to start streaming from wrong timeline if you
issue a
BASE_BACKUP at the same moment the server switches to a new timeline.

When pg_receivexlog switches timeline, what to do with the partial file
on
the old timeline? When the timeline changes in the middle of a WAL
segment,
the segment old the old timeline is only half-filled. For example, when
timeline changes from 1 to 2, you'll have this in pg_xlog:

000000010000000000000006
000000010000000000000007
000000010000000000000008
000000020000000000000008
00000002.history

The segment 000000010000000000000008 is only half-filled, as the timeline
changed in the middle of that segment. The beginning portion of that file
is
duplicated in 000000020000000000000008, with the timeline-changing
checkpoint record right after the duplicated portion.

When we stream that with pg_receivexlog, and hit the timeline switch,
we'll
have this situation in the client:

000000010000000000000006
000000010000000000000007
000000010000000000000008.partial

What to do with the partial file? One option is to rename it to
000000010000000000000008. However, if you then kill pg_receivexlog before
it
has finished streaming a full segment from the new timeline, on restart
it
will try to begin streaming WAL segment 000000010000000000000009, because
it
sees that segment 000000010000000000000008 is already completed. That'd
be
wrong.

Can't we rename .partial file safely after we receive a full segment
of the WAL file
with new timeline and the same logid/segmentid?

I'd prefer to leave the .partial suffix in place, as the segment really
isn't complete. It doesn't make a difference when you recover to the latest
timeline, but if you have a more complicated scenario with multiple
timelines that are still "alive", ie. there's a server still actively
generating WAL on that timeline, you'll easily get confused.

As an example, imagine that you have a master server, and one standby. You
maintain a WAL archive for backup purposes with pg_receivexlog, connected to
the standby. Now, for some reason, you get a split-brain situation and the
standby server is promoted with new timeline 2, while the real master is
still running. The DBA notices the problem, and kills the standby and
pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
pg_receivexlog's target directory, and re-points pg_recevexlog to the master
while he re-builds the standby server from backup. At that point,
pg_receivexlog will start streaming from the end of the zero-padded segment,
not knowing that it was partial, and you have a hole in the archived WAL
stream. Oops.

The DBA could avoid that by also removing the last WAL segment on timeline
1, the one that was partial. But it's really not obvious that there's
anything wrong with that segment. Keeping the .partial suffix makes it
clear.

Thanks for elaborating the reason why .partial suffix should be kept.
I agree that keeping the .partial suffix would be safer.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Fujii Masao (#4)
Re: Teaching pg_receivexlog to follow timeline switches

Fujii Masao <masao.fujii@gmail.com> writes:

Thanks for elaborating the reason why .partial suffix should be kept.
I agree that keeping the .partial suffix would be safer.

+1 to both points. So +2 I guess :)

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: Teaching pg_receivexlog to follow timeline switches

On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I'd prefer to leave the .partial suffix in place, as the segment really
isn't complete. It doesn't make a difference when you recover to the latest
timeline, but if you have a more complicated scenario with multiple
timelines that are still "alive", ie. there's a server still actively
generating WAL on that timeline, you'll easily get confused.

As an example, imagine that you have a master server, and one standby. You
maintain a WAL archive for backup purposes with pg_receivexlog, connected to
the standby. Now, for some reason, you get a split-brain situation and the
standby server is promoted with new timeline 2, while the real master is
still running. The DBA notices the problem, and kills the standby and
pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
pg_receivexlog's target directory, and re-points pg_recevexlog to the master
while he re-builds the standby server from backup. At that point,
pg_receivexlog will start streaming from the end of the zero-padded segment,
not knowing that it was partial, and you have a hole in the archived WAL
stream. Oops.

The DBA could avoid that by also removing the last WAL segment on timeline
1, the one that was partial. But it's really not obvious that there's
anything wrong with that segment. Keeping the .partial suffix makes it
clear.

I shudder at the idea that the DBA is manually involved in any of this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Robert Haas (#6)
Re: Teaching pg_receivexlog to follow timeline switches

On 17.01.2013 16:56, Robert Haas wrote:

On Wed, Jan 16, 2013 at 11:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I'd prefer to leave the .partial suffix in place, as the segment really
isn't complete. It doesn't make a difference when you recover to the latest
timeline, but if you have a more complicated scenario with multiple
timelines that are still "alive", ie. there's a server still actively
generating WAL on that timeline, you'll easily get confused.

As an example, imagine that you have a master server, and one standby. You
maintain a WAL archive for backup purposes with pg_receivexlog, connected to
the standby. Now, for some reason, you get a split-brain situation and the
standby server is promoted with new timeline 2, while the real master is
still running. The DBA notices the problem, and kills the standby and
pg_receivexlog. He deletes the XLOG files belonging to timeline 2 in
pg_receivexlog's target directory, and re-points pg_recevexlog to the master
while he re-builds the standby server from backup. At that point,
pg_receivexlog will start streaming from the end of the zero-padded segment,
not knowing that it was partial, and you have a hole in the archived WAL
stream. Oops.

The DBA could avoid that by also removing the last WAL segment on timeline
1, the one that was partial. But it's really not obvious that there's
anything wrong with that segment. Keeping the .partial suffix makes it
clear.

I shudder at the idea that the DBA is manually involved in any of this.

The scenario I described is that you screwed up your failover
environment, and end up with a split-brain situation by accident. The
DBA certainly needs to be involved to recover from that.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#7)
Re: Teaching pg_receivexlog to follow timeline switches

On Thu, Jan 17, 2013 at 9:59 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The scenario I described is that you screwed up your failover environment,
and end up with a split-brain situation by accident. The DBA certainly needs
to be involved to recover from that.

OK, I agree, but I still think a lot of DBAs would have no idea how to
handle that situation. I agree with your proposal, don't get me wrong
- I just think there's still an awful lot of room for operator error
in these more complex replication scenarios. I don't have a clue how
to fix that, and it's certainly not the purpose of this thread to fix
that; I'm just venting.

Actually, I'm really glad to see all the work you've done to improve
the way that some of these scenarios work and eliminate various bugs
and other surprising failure modes over the last couple of months.
It's great stuff. Alas, I think we still some distance from being
able to provide an "easy button".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#8)
Re: Teaching pg_receivexlog to follow timeline switches

Robert Haas escribió:

Actually, I'm really glad to see all the work you've done to improve
the way that some of these scenarios work and eliminate various bugs
and other surprising failure modes over the last couple of months.
It's great stuff.

+1

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Phil Sorber
phil@omniti.com
In reply to: Heikki Linnakangas (#1)
Re: Teaching pg_receivexlog to follow timeline switches

On Tue, Jan 15, 2013 at 9:05 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Now that a standby server can follow timeline switches through streaming
replication, we should do teach pg_receivexlog to do the same. Patch
attached.

Is it possible to re-use walreceiver code from the backend?

I was thinking that it would actually be very useful to have the whole
replication functionality modularized and in a standalone binary that
could act as a replication proxy and WAL archiver that could run
without all the overhead of an entire PG instance.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Phil Sorber (#10)
Re: Teaching pg_receivexlog to follow timeline switches

On 18.01.2013 06:38, Phil Sorber wrote:

On Tue, Jan 15, 2013 at 9:05 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Now that a standby server can follow timeline switches through streaming
replication, we should do teach pg_receivexlog to do the same. Patch
attached.

Is it possible to re-use walreceiver code from the backend?

I was thinking that it would actually be very useful to have the whole
replication functionality modularized and in a standalone binary that
could act as a replication proxy and WAL archiver that could run
without all the overhead of an entire PG instance

There's much sense in trying to extract that into a stand-along module.
src/bin/pg_basebackup/receivelog.c is about 1000 lines of code at the
moment, and it looks quite different from the corresponding code in the
backend, because it doesn't have all the backend infrastructure available.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Phil Sorber
phil@omniti.com
In reply to: Heikki Linnakangas (#11)
Re: Teaching pg_receivexlog to follow timeline switches

On Fri, Jan 18, 2013 at 7:55 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 18.01.2013 06:38, Phil Sorber wrote:

Is it possible to re-use walreceiver code from the backend?

I was thinking that it would actually be very useful to have the whole
replication functionality modularized and in a standalone binary that
could act as a replication proxy and WAL archiver that could run
without all the overhead of an entire PG instance

There's much sense in trying to extract that into a stand-along module.
src/bin/pg_basebackup/receivelog.c is about 1000 lines of code at the
moment, and it looks quite different from the corresponding code in the
backend, because it doesn't have all the backend infrastructure available.

- Heikki

That's fair.

What do you think about the idea of a full WAL proxy? Probably not for
9.3 at this point though.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Noah Misch
noah@leadboat.com
In reply to: Heikki Linnakangas (#1)
Re: Teaching pg_receivexlog to follow timeline switches

This patch was in Needs Review status, but you committed it on 2013-01-17. I
have marked it as such in the CF app.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Phil Sorber (#12)
Re: Teaching pg_receivexlog to follow timeline switches

Phil Sorber <phil@omniti.com> writes:

What do you think about the idea of a full WAL proxy? Probably not for
9.3 at this point though.

I was thinking that a WAL proxy nowadays is called a cascading standby
with local archiving enabled. I'm not sure why you would want to trust
your archiving and WAL relaying to another piece of software…

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Dimitri Fontaine (#14)
Re: Teaching pg_receivexlog to follow timeline switches

On 22.01.2013 15:02, Dimitri Fontaine wrote:

Phil Sorber<phil@omniti.com> writes:

What do you think about the idea of a full WAL proxy? Probably not for
9.3 at this point though.

I was thinking that a WAL proxy nowadays is called a cascading standby
with local archiving enabled. I'm not sure why you would want to trust
your archiving and WAL relaying to another piece of software…

You might not want to keep a copy of the whole data directory around, as
you have to in a cascading standby. I can see value in a separate WAL
proxy software, especially if it's integrated into a larger backup
manager program like barman or wal-e.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Heikki Linnakangas (#15)
Re: Teaching pg_receivexlog to follow timeline switches

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

You might not want to keep a copy of the whole data directory around, as you
have to in a cascading standby. I can see value in a separate WAL proxy
software, especially if it's integrated into a larger backup manager program
like barman or wal-e.

+1

I somehow forgot about $PGDATA here. Time for a little break I guess :)

Another idea is to have a daemon mode pg_receivexlog where not only it
can maintain a local archive but also feed it using the replication
protocol to standbies, keeping track of their position.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Phil Sorber
phil@omniti.com
In reply to: Dimitri Fontaine (#16)
Re: Teaching pg_receivexlog to follow timeline switches

On Tue, Jan 22, 2013 at 8:33 AM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

You might not want to keep a copy of the whole data directory around, as you
have to in a cascading standby. I can see value in a separate WAL proxy
software, especially if it's integrated into a larger backup manager program
like barman or wal-e.

+1

I somehow forgot about $PGDATA here. Time for a little break I guess :)

Another idea is to have a daemon mode pg_receivexlog where not only it
can maintain a local archive but also feed it using the replication
protocol to standbies, keeping track of their position.

I'm not sure if i described it well, but that's essentially what I was
asking about. It would have both wal receiving and and wal sending
capability. Along with it's own local WAL storage perhaps governed in
size by a keep_wal_segments and also a longer term archive that you
could have compressed but also pull from with a archive and restore
command. And also be able to act as a synchronous replication peer. I
think it has already been discussed to have pg_receivexlog do that
last one.

So yeah, a cascading standby without $PGDATA or hot_standby or large
shared_buffers resources. It seems like maybe we could add through
subtraction. Add a parameter that disables wal replay? I'm sure
there'd be more things it would have to disable, but then it's not two
separate binaries.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Craig Ringer
craig@2ndQuadrant.com
In reply to: Noah Misch (#13)
Re: Teaching pg_receivexlog to follow timeline switches

On 01/22/2013 06:43 AM, Noah Misch wrote:

This patch was in Needs Review status, but you committed it on 2013-01-17. I
have marked it as such in the CF app.

Thankyou. There's a lot to keep up with :S

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers