WAL format changes

Started by Heikki Linnakangasover 13 years ago31 messages

heikki.linnakangas@enterprisedb.com

over 13 years ago

4 attachment(s)

As I threatened earlier
(http://archives.postgresql.org/message-id/4FD0B1AB.3090405@enterprisedb.com),
here are three patches that change the WAL format. The goal is to change
the format so that when you're inserting a WAL record of a given size,
you know exactly how much space it requires in the WAL.

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

2. Always include the xl_rem_len field, used for continuation records,
in the xlog page header. A continuation log record only contained that
one field, it's now included straight in the page header, so the concept
of a continuation record doesn't exist anymore. Because of alignment,
this wastes 4 bytes on every page that contains continued data from a
previous record, and 8 bytes on pages that don't. That's not very much,
and the next step will buy that back:

3. Allow WAL record header to be split across pages. Per Tom's
suggestion, move xl_tot_len to be the first field in XLogRecord, so that
even if the header is split, xl_tot_len is always on the first page.
xl_crc is moved to be the last field, and xl_prev is the second to last.
This has the advantage that you can calculate the CRC for all the other
fields before acquiring WALInsertLock. For xl_prev, you need to know
where exactly the record is inserted, so it's handy that it's the last
field before CRC. This patch doesn't try to take advantage of that,
however, and I'm not sure if that makes any difference once I finish the
patch to make XLogInsert scale better, which is the ultimate goal of all
this.

Those are the three patches I'd like to get committed in this
commitfest. To see where all this is leading to, I've included a rough
WIP version of the XLogInsert scaling patch. This version is quite
different from the one I posted in spring, it takes advantage of the WAL
format changes, and I'm also experimenting with a different method of
tracking how far each WAL insertion has progressed. But more on that later.

(Note to self: remember to bump XLOG_PAGE_MAGIC)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

1-use-uint64-got-segno.patchtext/x-diff; name=1-use-uint64-got-segno.patchDownload

commit ac8e32d5db90556d1beba00fd251113e17e4b8ce
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Fri Jun 1 13:57:51 2012 +0300

    Use a 64-bit segment number instead of the log/seg combination.
    
    The last segment of each 4GB logical log file is no longer wasted.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d3650bd..fddfbc4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -385,8 +385,7 @@ typedef struct XLogCtlData
 	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
 	TransactionId ckptXid;
 	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
-	uint32		lastRemovedLog; /* latest removed/recycled XLOG segment */
-	uint32		lastRemovedSeg;
+	XLogSegNo	lastRemovedSegNo; /* latest removed/recycled XLOG segment */
 
 	/* Protected by WALWriteLock: */
 	XLogCtlWrite Write;
@@ -491,11 +490,13 @@ static ControlFileData *ControlFile = NULL;
 
 /* Construct XLogRecPtr value for current insertion point */
 #define INSERT_RECPTR(recptr,Insert,curridx)  \
-	( \
-	  (recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid, \
-	  (recptr).xrecoff = \
-		XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert) \
-	)
+	do {																\
+		(recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid;			\
+		(recptr).xrecoff =												\
+			XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert); \
+		if (XLogCtl->xlblocks[curridx].xrecoff == 0)					\
+			(recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid - 1;	\
+	} while(0)
 
 #define PrevBufIdx(idx)		\
 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
@@ -521,12 +522,11 @@ static XLogwrtResult LogwrtResult = {{0, 0}, {0, 0}};
 /*
  * openLogFile is -1 or a kernel FD for an open log file segment.
  * When it's open, openLogOff is the current seek offset in the file.
- * openLogId/openLogSeg identify the segment.  These variables are only
+ * openLogSegNo identifies the segment.  These variables are only
  * used to write the XLOG, and so will normally refer to the active segment.
  */
 static int	openLogFile = -1;
-static uint32 openLogId = 0;
-static uint32 openLogSeg = 0;
+static XLogSegNo openLogSegNo = 0;
 static uint32 openLogOff = 0;
 
 /*
@@ -538,8 +538,7 @@ static uint32 openLogOff = 0;
  * the currently open file from.
  */
 static int	readFile = -1;
-static uint32 readId = 0;
-static uint32 readSeg = 0;
+static XLogSegNo readSegNo = 0;
 static uint32 readOff = 0;
 static uint32 readLen = 0;
 static int	readSource = 0;		/* XLOG_FROM_* code */
@@ -608,13 +607,12 @@ typedef struct xl_restore_point
 
 
 static void XLogArchiveNotify(const char *xlog);
-static void XLogArchiveNotifySeg(uint32 log, uint32 seg);
+static void XLogArchiveNotifySeg(XLogSegNo segno);
 static bool XLogArchiveCheckDone(const char *xlog);
 static bool XLogArchiveIsBusy(const char *xlog);
 static void XLogArchiveCleanup(const char *xlog);
 static void readRecoveryCommandFile(void);
-static void exitArchiveRecovery(TimeLineID endTLI,
-					uint32 endLogId, uint32 endLogSeg);
+static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
 static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
 static void recoveryPausesHere(void);
 static void SetLatestXTime(TimestampTz xtime);
@@ -623,20 +621,19 @@ static void CheckRequiredParameterValues(void);
 static void XLogReportParameters(void);
 static void LocalSetXLogInsertAllowed(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
-static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
+static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 
 static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
 				XLogRecPtr *lsn, BkpBlock *bkpb);
 static bool AdvanceXLInsertBuffer(bool new_segment);
-static bool XLogCheckpointNeeded(uint32 logid, uint32 logseg);
+static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
-static bool InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
+static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, int *max_advance,
 					   bool use_lock);
-static int XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
+static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notexistOk);
-static int XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode,
-				   int sources);
+static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources);
 static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 			 bool randAccess);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
@@ -646,7 +643,7 @@ static bool RestoreArchivedFile(char *path, const char *xlogfname,
 static void ExecuteRecoveryCommand(char *command, char *commandName,
 					   bool failOnerror);
 static void PreallocXlogFiles(XLogRecPtr endptr);
-static void RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr);
+static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
@@ -660,8 +657,7 @@ static bool existsTimeLineHistory(TimeLineID probeTLI);
 static bool rescanLatestTimeLine(void);
 static TimeLineID findNewestTimeLine(TimeLineID startTLI);
 static void writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
-					 TimeLineID endTLI,
-					 uint32 endLogId, uint32 endLogSeg);
+					 TimeLineID endTLI, XLogSegNo endLogSegNo);
 static void WriteControlFile(void);
 static void ReadControlFile(void);
 static char *str_time(pg_time_t tnow);
@@ -993,12 +989,6 @@ begin:;
 		LWLockRelease(WALInsertLock);
 
 		RecPtr.xrecoff -= SizeOfXLogLongPHD;
-		if (RecPtr.xrecoff == 0)
-		{
-			/* crossing a logid boundary */
-			RecPtr.xlogid -= 1;
-			RecPtr.xrecoff = XLogFileSize;
-		}
 
 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 		LogwrtResult = XLogCtl->LogwrtResult;
@@ -1145,13 +1135,12 @@ begin:;
 
 		/* Compute end address of old segment */
 		OldSegEnd = XLogCtl->xlblocks[curridx];
-		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
 		if (OldSegEnd.xrecoff == 0)
 		{
 			/* crossing a logid boundary */
 			OldSegEnd.xlogid -= 1;
-			OldSegEnd.xrecoff = XLogFileSize;
 		}
+		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
 
 		/* Make it look like we've written and synced all of old segment */
 		LogwrtResult.Write = OldSegEnd;
@@ -1321,14 +1310,14 @@ XLogArchiveNotify(const char *xlog)
 }
 
 /*
- * Convenience routine to notify using log/seg representation of filename
+ * Convenience routine to notify using segment number representation of filename
  */
 static void
-XLogArchiveNotifySeg(uint32 log, uint32 seg)
+XLogArchiveNotifySeg(XLogSegNo segno)
 {
 	char		xlog[MAXFNAMELEN];
 
-	XLogFileName(xlog, ThisTimeLineID, log, seg);
+	XLogFileName(xlog, ThisTimeLineID, segno);
 	XLogArchiveNotify(xlog);
 }
 
@@ -1465,6 +1454,7 @@ AdvanceXLInsertBuffer(bool new_segment)
 	XLogRecPtr	OldPageRqstPtr;
 	XLogwrtRqst WriteRqst;
 	XLogRecPtr	NewPageEndPtr;
+	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
 
 	/*
@@ -1529,23 +1519,18 @@ AdvanceXLInsertBuffer(bool new_segment)
 	 * Now the next buffer slot is free and we can set it up to be the next
 	 * output page.
 	 */
-	NewPageEndPtr = XLogCtl->xlblocks[Insert->curridx];
+	NewPageBeginPtr = XLogCtl->xlblocks[Insert->curridx];
 
 	if (new_segment)
 	{
 		/* force it to a segment start point */
-		NewPageEndPtr.xrecoff += XLogSegSize - 1;
-		NewPageEndPtr.xrecoff -= NewPageEndPtr.xrecoff % XLogSegSize;
+		if (NewPageBeginPtr.xrecoff % XLogSegSize != 0)
+			XLByteAdvance(NewPageBeginPtr,
+						  XLogSegSize - NewPageBeginPtr.xrecoff % XLogSegSize);
 	}
 
-	if (NewPageEndPtr.xrecoff >= XLogFileSize)
-	{
-		/* crossing a logid boundary */
-		NewPageEndPtr.xlogid += 1;
-		NewPageEndPtr.xrecoff = XLOG_BLCKSZ;
-	}
-	else
-		NewPageEndPtr.xrecoff += XLOG_BLCKSZ;
+	NewPageEndPtr = NewPageBeginPtr;
+	XLByteAdvance(NewPageEndPtr, XLOG_BLCKSZ);
 	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
 	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
 
@@ -1567,8 +1552,7 @@ AdvanceXLInsertBuffer(bool new_segment)
 
 	/* NewPage->xlp_info = 0; */	/* done by memset */
 	NewPage   ->xlp_tli = ThisTimeLineID;
-	NewPage   ->xlp_pageaddr.xlogid = NewPageEndPtr.xlogid;
-	NewPage   ->xlp_pageaddr.xrecoff = NewPageEndPtr.xrecoff - XLOG_BLCKSZ;
+	NewPage   ->xlp_pageaddr = NewPageBeginPtr;
 
 	/*
 	 * If online backup is not in progress, mark the header to indicate that
@@ -1606,33 +1590,20 @@ AdvanceXLInsertBuffer(bool new_segment)
 /*
  * Check whether we've consumed enough xlog space that a checkpoint is needed.
  *
- * logid/logseg indicate a log file that has just been filled up (or read
- * during recovery). We measure the distance from RedoRecPtr to logid/logseg
+ * new_segno indicates a log file that has just been filled up (or read
+ * during recovery). We measure the distance from RedoRecPtr to new_segno
  * and see if that exceeds CheckPointSegments.
  *
  * Note: it is caller's responsibility that RedoRecPtr is up-to-date.
  */
 static bool
-XLogCheckpointNeeded(uint32 logid, uint32 logseg)
+XLogCheckpointNeeded(XLogSegNo new_segno)
 {
-	/*
-	 * A straight computation of segment number could overflow 32 bits. Rather
-	 * than assuming we have working 64-bit arithmetic, we compare the
-	 * highest-order bits separately, and force a checkpoint immediately when
-	 * they change.
-	 */
-	uint32		old_segno,
-				new_segno;
-	uint32		old_highbits,
-				new_highbits;
-
-	old_segno = (RedoRecPtr.xlogid % XLogSegSize) * XLogSegsPerFile +
-		(RedoRecPtr.xrecoff / XLogSegSize);
-	old_highbits = RedoRecPtr.xlogid / XLogSegSize;
-	new_segno = (logid % XLogSegSize) * XLogSegsPerFile + logseg;
-	new_highbits = logid / XLogSegSize;
-	if (new_highbits != old_highbits ||
-		new_segno >= old_segno + (uint32) (CheckPointSegments - 1))
+	XLogSegNo	old_segno;
+
+	XLByteToSeg(RedoRecPtr, old_segno);
+
+	if (new_segno >= old_segno + (uint64) (CheckPointSegments - 1))
 		return true;
 	return false;
 }
@@ -1713,7 +1684,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
 		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
 
-		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
+		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
 		{
 			/*
 			 * Switch to new logfile segment.  We cannot have any pending
@@ -1722,20 +1693,19 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			Assert(npages == 0);
 			if (openLogFile >= 0)
 				XLogFileClose();
-			XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
+			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogId, openLogSeg,
-									   &use_existent, true);
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
 			openLogOff = 0;
 		}
 
 		/* Make sure we have the current logfile open */
 		if (openLogFile < 0)
 		{
-			XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
-			openLogFile = XLogFileOpen(openLogId, openLogSeg);
+			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
+			openLogFile = XLogFileOpen(openLogSegNo);
 			openLogOff = 0;
 		}
 
@@ -1772,9 +1742,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 				if (lseek(openLogFile, (off_t) startoffset, SEEK_SET) < 0)
 					ereport(PANIC,
 							(errcode_for_file_access(),
-							 errmsg("could not seek in log file %u, "
-									"segment %u to offset %u: %m",
-									openLogId, openLogSeg, startoffset)));
+							 errmsg("could not seek in log file %s to offset %u: %m",
+									XLogFileNameP(ThisTimeLineID, openLogSegNo),
+									startoffset)));
 				openLogOff = startoffset;
 			}
 
@@ -1789,9 +1759,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 					errno = ENOSPC;
 				ereport(PANIC,
 						(errcode_for_file_access(),
-						 errmsg("could not write to log file %u, segment %u "
+						 errmsg("could not write to log file %s "
 								"at offset %u, length %lu: %m",
-								openLogId, openLogSeg,
+								XLogFileNameP(ThisTimeLineID, openLogSegNo),
 								openLogOff, (unsigned long) nbytes)));
 			}
 
@@ -1818,11 +1788,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			 */
 			if (finishing_seg || (xlog_switch && last_iteration))
 			{
-				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
+				issue_xlog_fsync(openLogFile, openLogSegNo);
 				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
 
 				if (XLogArchivingActive())
-					XLogArchiveNotifySeg(openLogId, openLogSeg);
+					XLogArchiveNotifySeg(openLogSegNo);
 
 				Write->lastSegSwitchTime = (pg_time_t) time(NULL);
 
@@ -1833,11 +1803,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 				 * date; if it looks like a checkpoint is needed, forcibly
 				 * update RedoRecPtr and recheck.
 				 */
-				if (IsUnderPostmaster &&
-					XLogCheckpointNeeded(openLogId, openLogSeg))
+				if (IsUnderPostmaster && XLogCheckpointNeeded(openLogSegNo))
 				{
 					(void) GetRedoRecPtr();
-					if (XLogCheckpointNeeded(openLogId, openLogSeg))
+					if (XLogCheckpointNeeded(openLogSegNo))
 						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
 				}
 			}
@@ -1874,15 +1843,15 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			sync_method != SYNC_METHOD_OPEN_DSYNC)
 		{
 			if (openLogFile >= 0 &&
-				!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
+				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
 				XLogFileClose();
 			if (openLogFile < 0)
 			{
-				XLByteToPrevSeg(LogwrtResult.Write, openLogId, openLogSeg);
-				openLogFile = XLogFileOpen(openLogId, openLogSeg);
+				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
+				openLogFile = XLogFileOpen(openLogSegNo);
 				openLogOff = 0;
 			}
-			issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
+			issue_xlog_fsync(openLogFile, openLogSegNo);
 		}
 		LogwrtResult.Flush = LogwrtResult.Write;
 	}
@@ -2126,6 +2095,8 @@ XLogFlush(XLogRecPtr record)
 				else
 				{
 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
+					if (WriteRqstPtr.xrecoff == 0)
+						WriteRqstPtr.xlogid--;
 					WriteRqstPtr.xrecoff -= freespace;
 				}
 				LWLockRelease(WALInsertLock);
@@ -2237,7 +2208,7 @@ XLogBackgroundFlush(void)
 	{
 		if (openLogFile >= 0)
 		{
-			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
+			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
 			{
 				XLogFileClose();
 			}
@@ -2361,19 +2332,17 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(uint32 log, uint32 seg,
-			 bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	char	   *zbuffer;
-	uint32		installed_log;
-	uint32		installed_seg;
+	XLogSegNo	installed_segno;
 	int			max_advance;
 	int			fd;
 	int			nbytes;
 
-	XLogFilePath(path, ThisTimeLineID, log, seg);
+	XLogFilePath(path, ThisTimeLineID, logsegno);
 
 	/*
 	 * Try to use existent file (checkpoint maker may have created it already)
@@ -2387,8 +2356,7 @@ XLogFileInit(uint32 log, uint32 seg,
 			if (errno != ENOENT)
 				ereport(ERROR,
 						(errcode_for_file_access(),
-						 errmsg("could not open file \"%s\" (log file %u, segment %u): %m",
-								path, log, seg)));
+						 errmsg("could not open file \"%s\": %m", path)));
 		}
 		else
 			return fd;
@@ -2467,10 +2435,9 @@ XLogFileInit(uint32 log, uint32 seg,
 	 * has created the file while we were filling ours: if so, use ours to
 	 * pre-create a future log segment.
 	 */
-	installed_log = log;
-	installed_seg = seg;
+	installed_segno = logsegno;
 	max_advance = XLOGfileslop;
-	if (!InstallXLogFileSegment(&installed_log, &installed_seg, tmppath,
+	if (!InstallXLogFileSegment(&installed_segno, tmppath,
 								*use_existent, &max_advance,
 								use_lock))
 	{
@@ -2491,8 +2458,7 @@ XLogFileInit(uint32 log, uint32 seg,
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-		   errmsg("could not open file \"%s\" (log file %u, segment %u): %m",
-				  path, log, seg)));
+		   errmsg("could not open file \"%s\": %m", path)));
 
 	elog(DEBUG2, "done creating and filling new WAL file");
 
@@ -2512,8 +2478,7 @@ XLogFileInit(uint32 log, uint32 seg,
  * emplacing a bogus file.
  */
 static void
-XLogFileCopy(uint32 log, uint32 seg,
-			 TimeLineID srcTLI, uint32 srclog, uint32 srcseg)
+XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -2525,7 +2490,7 @@ XLogFileCopy(uint32 log, uint32 seg,
 	/*
 	 * Open the source file
 	 */
-	XLogFilePath(path, srcTLI, srclog, srcseg);
+	XLogFilePath(path, srcTLI, srcsegno);
 	srcfd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
 	if (srcfd < 0)
 		ereport(ERROR,
@@ -2596,7 +2561,7 @@ XLogFileCopy(uint32 log, uint32 seg,
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&log, &seg, tmppath, false, NULL, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, NULL, false))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -2630,14 +2595,14 @@ XLogFileCopy(uint32 log, uint32 seg,
  * file into place.
  */
 static bool
-InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
+InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, int *max_advance,
 					   bool use_lock)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
 
-	XLogFilePath(path, ThisTimeLineID, *log, *seg);
+	XLogFilePath(path, ThisTimeLineID, *segno);
 
 	/*
 	 * We want to be sure that only one process does this at a time.
@@ -2662,9 +2627,9 @@ InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
 					LWLockRelease(ControlFileLock);
 				return false;
 			}
-			NextLogSeg(*log, *seg);
+			(*segno)++;
 			(*max_advance)--;
-			XLogFilePath(path, ThisTimeLineID, *log, *seg);
+			XLogFilePath(path, ThisTimeLineID, *segno);
 		}
 	}
 
@@ -2680,8 +2645,8 @@ InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
 			LWLockRelease(ControlFileLock);
 		ereport(LOG,
 				(errcode_for_file_access(),
-				 errmsg("could not link file \"%s\" to \"%s\" (initialization of log file %u, segment %u): %m",
-						tmppath, path, *log, *seg)));
+				 errmsg("could not link file \"%s\" to \"%s\" (initialization of log file): %m",
+						tmppath, path)));
 		return false;
 	}
 	unlink(tmppath);
@@ -2692,8 +2657,8 @@ InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
 			LWLockRelease(ControlFileLock);
 		ereport(LOG,
 				(errcode_for_file_access(),
-				 errmsg("could not rename file \"%s\" to \"%s\" (initialization of log file %u, segment %u): %m",
-						tmppath, path, *log, *seg)));
+				 errmsg("could not rename file \"%s\" to \"%s\" (initialization of log file): %m",
+						tmppath, path)));
 		return false;
 	}
 #endif
@@ -2708,20 +2673,19 @@ InstallXLogFileSegment(uint32 *log, uint32 *seg, char *tmppath,
  * Open a pre-existing logfile segment for writing.
  */
 int
-XLogFileOpen(uint32 log, uint32 seg)
+XLogFileOpen(XLogSegNo segno)
 {
 	char		path[MAXPGPATH];
 	int			fd;
 
-	XLogFilePath(path, ThisTimeLineID, log, seg);
+	XLogFilePath(path, ThisTimeLineID, segno);
 
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
 					   S_IRUSR | S_IWUSR);
 	if (fd < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
-		   errmsg("could not open file \"%s\" (log file %u, segment %u): %m",
-				  path, log, seg)));
+				 errmsg("could not open xlog file \"%s\": %m", path)));
 
 	return fd;
 }
@@ -2733,7 +2697,7 @@ XLogFileOpen(uint32 log, uint32 seg)
  * Otherwise, it's assumed to be already available in pg_xlog.
  */
 static int
-XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
+XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notfoundOk)
 {
 	char		xlogfname[MAXFNAMELEN];
@@ -2741,7 +2705,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
 	char		path[MAXPGPATH];
 	int			fd;
 
-	XLogFileName(xlogfname, tli, log, seg);
+	XLogFileName(xlogfname, tli, segno);
 
 	switch (source)
 	{
@@ -2760,7 +2724,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
 
 		case XLOG_FROM_PG_XLOG:
 		case XLOG_FROM_STREAM:
-			XLogFilePath(path, tli, log, seg);
+			XLogFilePath(path, tli, segno);
 			restoredFromArchive = false;
 			break;
 
@@ -2781,7 +2745,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
 		bool			reload = false;
 		struct stat		statbuf;
 
-		XLogFilePath(xlogfpath, tli, log, seg);
+		XLogFilePath(xlogfpath, tli, segno);
 		if (stat(xlogfpath, &statbuf) == 0)
 		{
 			if (unlink(xlogfpath) != 0)
@@ -2810,8 +2774,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
 		 * shmem. It's used as current standby flush position, and cascading
 		 * walsenders try to send WAL records up to this location.
 		 */
-		endptr.xlogid = log;
-		endptr.xrecoff = seg * XLogSegSize;
+		XLogSegNoOffsetToRecPtr(segno, 0, endptr);
 		XLByteAdvance(endptr, XLogSegSize);
 
 		SpinLockAcquire(&xlogctl->info_lck);
@@ -2846,8 +2809,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
 	if (errno != ENOENT || !notfoundOk) /* unexpected failure? */
 		ereport(PANIC,
 				(errcode_for_file_access(),
-		   errmsg("could not open file \"%s\" (log file %u, segment %u): %m",
-				  path, log, seg)));
+				 errmsg("could not open file \"%s\": %m", path)));
 	return -1;
 }
 
@@ -2857,7 +2819,7 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
  * This version searches for the segment with any TLI listed in expectedTLIs.
  */
 static int
-XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)
+XLogFileReadAnyTLI(XLogSegNo segno, int emode, int sources)
 {
 	char		path[MAXPGPATH];
 	ListCell   *cell;
@@ -2882,7 +2844,7 @@ XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)
 
 		if (sources & XLOG_FROM_ARCHIVE)
 		{
-			fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true);
+			fd = XLogFileRead(segno, emode, tli, XLOG_FROM_ARCHIVE, true);
 			if (fd != -1)
 			{
 				elog(DEBUG1, "got WAL segment from archive");
@@ -2892,19 +2854,18 @@ XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)
 
 		if (sources & XLOG_FROM_PG_XLOG)
 		{
-			fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_PG_XLOG, true);
+			fd = XLogFileRead(segno, emode, tli, XLOG_FROM_PG_XLOG, true);
 			if (fd != -1)
 				return fd;
 		}
 	}
 
 	/* Couldn't find it.  For simplicity, complain about front timeline */
-	XLogFilePath(path, recoveryTargetTLI, log, seg);
+	XLogFilePath(path, recoveryTargetTLI, segno);
 	errno = ENOENT;
 	ereport(emode,
 			(errcode_for_file_access(),
-		   errmsg("could not open file \"%s\" (log file %u, segment %u): %m",
-				  path, log, seg)));
+			 errmsg("could not open file \"%s\": %m", path)));
 	return -1;
 }
 
@@ -2930,8 +2891,8 @@ XLogFileClose(void)
 	if (close(openLogFile))
 		ereport(PANIC,
 				(errcode_for_file_access(),
-				 errmsg("could not close log file %u, segment %u: %m",
-						openLogId, openLogSeg)));
+				 errmsg("could not close log file %s: %m",
+						XLogFileNameP(ThisTimeLineID, openLogSegNo))));
 	openLogFile = -1;
 }
 
@@ -2962,8 +2923,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	int			rc;
 	bool		signaled;
 	struct stat stat_buf;
-	uint32		restartLog;
-	uint32		restartSeg;
+	XLogSegNo	restartSegNo;
 
 	/* In standby mode, restore_command might not be supplied */
 	if (recoveryRestoreCommand == NULL)
@@ -3032,16 +2992,15 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	 */
 	if (InRedo)
 	{
-		XLByteToSeg(ControlFile->checkPointCopy.redo,
-					restartLog, restartSeg);
+		XLByteToSeg(ControlFile->checkPointCopy.redo, restartSegNo);
 		XLogFileName(lastRestartPointFname,
 					 ControlFile->checkPointCopy.ThisTimeLineID,
-					 restartLog, restartSeg);
+					 restartSegNo);
 		/* we shouldn't need anything earlier than last restart point */
 		Assert(strcmp(lastRestartPointFname, xlogfname) <= 0);
 	}
 	else
-		XLogFileName(lastRestartPointFname, 0, 0, 0);
+		XLogFileName(lastRestartPointFname, 0, 0L);
 
 	/*
 	 * construct the command to be executed
@@ -3236,8 +3195,7 @@ ExecuteRecoveryCommand(char *command, char *commandName, bool failOnSignal)
 	const char *sp;
 	int			rc;
 	bool		signaled;
-	uint32		restartLog;
-	uint32		restartSeg;
+	XLogSegNo	restartSegNo;
 
 	Assert(command && commandName);
 
@@ -3247,11 +3205,10 @@ ExecuteRecoveryCommand(char *command, char *commandName, bool failOnSignal)
 	 * archive, though there is no requirement to do so.
 	 */
 	LWLockAcquire(ControlFileLock, LW_SHARED);
-	XLByteToSeg(ControlFile->checkPointCopy.redo,
-				restartLog, restartSeg);
+	XLByteToSeg(ControlFile->checkPointCopy.redo, restartSegNo);
 	XLogFileName(lastRestartPointFname,
 				 ControlFile->checkPointCopy.ThisTimeLineID,
-				 restartLog, restartSeg);
+				 restartSegNo);
 	LWLockRelease(ControlFileLock);
 
 	/*
@@ -3332,18 +3289,17 @@ ExecuteRecoveryCommand(char *command, char *commandName, bool failOnSignal)
 static void
 PreallocXlogFiles(XLogRecPtr endptr)
 {
-	uint32		_logId;
-	uint32		_logSeg;
+	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		use_existent;
 
-	XLByteToPrevSeg(endptr, _logId, _logSeg);
+	XLByteToPrevSeg(endptr, _logSegNo);
 	if ((endptr.xrecoff - 1) % XLogSegSize >=
 		(uint32) (0.75 * XLogSegSize))
 	{
-		NextLogSeg(_logId, _logSeg);
+		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logId, _logSeg, &use_existent, true);
+		lf = XLogFileInit(_logSegNo, &use_existent, true);
 		close(lf);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
@@ -3355,14 +3311,13 @@ PreallocXlogFiles(XLogRecPtr endptr)
  * Returns 0/0 if no WAL segments have been removed since startup.
  */
 void
-XLogGetLastRemoved(uint32 *log, uint32 *seg)
+XLogGetLastRemoved(XLogSegNo *segno)
 {
 	/* use volatile pointer to prevent code rearrangement */
 	volatile XLogCtlData *xlogctl = XLogCtl;
 
 	SpinLockAcquire(&xlogctl->info_lck);
-	*log = xlogctl->lastRemovedLog;
-	*seg = xlogctl->lastRemovedSeg;
+	*segno = xlogctl->lastRemovedSegNo;
 	SpinLockRelease(&xlogctl->info_lck);
 }
 
@@ -3375,19 +3330,14 @@ UpdateLastRemovedPtr(char *filename)
 {
 	/* use volatile pointer to prevent code rearrangement */
 	volatile XLogCtlData *xlogctl = XLogCtl;
-	uint32		tli,
-				log,
-				seg;
+	uint32		tli;
+	XLogSegNo	segno;
 
-	XLogFromFileName(filename, &tli, &log, &seg);
+	XLogFromFileName(filename, &tli, &segno);
 
 	SpinLockAcquire(&xlogctl->info_lck);
-	if (log > xlogctl->lastRemovedLog ||
-		(log == xlogctl->lastRemovedLog && seg > xlogctl->lastRemovedSeg))
-	{
-		xlogctl->lastRemovedLog = log;
-		xlogctl->lastRemovedSeg = seg;
-	}
+	if (segno > xlogctl->lastRemovedSegNo)
+		xlogctl->lastRemovedSegNo = segno;
 	SpinLockRelease(&xlogctl->info_lck);
 }
 
@@ -3398,10 +3348,9 @@ UpdateLastRemovedPtr(char *filename)
  * whether we want to recycle rather than delete no-longer-wanted log files.
  */
 static void
-RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr)
+RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
 {
-	uint32		endlogId;
-	uint32		endlogSeg;
+	XLogSegNo	endlogSegNo;
 	int			max_advance;
 	DIR		   *xldir;
 	struct dirent *xlde;
@@ -3417,7 +3366,7 @@ RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr)
 	 * Initialize info about where to try to recycle to.  We allow recycling
 	 * segments up to XLOGfileslop segments beyond the current XLOG location.
 	 */
-	XLByteToPrevSeg(endptr, endlogId, endlogSeg);
+	XLByteToPrevSeg(endptr, endlogSegNo);
 	max_advance = XLOGfileslop;
 
 	xldir = AllocateDir(XLOGDIR);
@@ -3427,7 +3376,7 @@ RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr)
 				 errmsg("could not open transaction log directory \"%s\": %m",
 						XLOGDIR)));
 
-	XLogFileName(lastoff, ThisTimeLineID, log, seg);
+	XLogFileName(lastoff, ThisTimeLineID, segno);
 
 	elog(DEBUG2, "attempting to remove WAL segments older than log file %s",
 		 lastoff);
@@ -3463,7 +3412,7 @@ RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr)
 				 * separate archive directory.
 				 */
 				if (lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
-					InstallXLogFileSegment(&endlogId, &endlogSeg, path,
+					InstallXLogFileSegment(&endlogSegNo, path,
 										   true, &max_advance, true))
 				{
 					ereport(DEBUG2,
@@ -3473,7 +3422,7 @@ RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr)
 					/* Needn't recheck that slot on future iterations */
 					if (max_advance > 0)
 					{
-						NextLogSeg(endlogId, endlogSeg);
+						endlogSegNo++;
 						max_advance--;
 					}
 				}
@@ -3812,13 +3761,6 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 		if (XLOG_BLCKSZ - (RecPtr->xrecoff % XLOG_BLCKSZ) < SizeOfXLogRecord)
 			NextLogPage(*RecPtr);
 
-		/* Check for crossing of xlog segment boundary */
-		if (RecPtr->xrecoff >= XLogFileSize)
-		{
-			(RecPtr->xlogid)++;
-			RecPtr->xrecoff = 0;
-		}
-
 		/*
 		 * If at page start, we must skip over the page header.  But we can't
 		 * do that until we've read in the page, since the header size is
@@ -4002,12 +3944,7 @@ retry:
 		for (;;)
 		{
 			/* Calculate pointer to beginning of next page */
-			pagelsn.xrecoff += XLOG_BLCKSZ;
-			if (pagelsn.xrecoff >= XLogFileSize)
-			{
-				(pagelsn.xlogid)++;
-				pagelsn.xrecoff = 0;
-			}
+			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
 			/* Wait for the next page to become available */
 			if (!XLogPageRead(&pagelsn, emode, false, false))
 				return NULL;
@@ -4016,8 +3953,9 @@ retry:
 			if (!(((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD))
 			{
 				ereport(emode_for_corrupt_record(emode, *RecPtr),
-						(errmsg("there is no contrecord flag in log file %u, segment %u, offset %u",
-								readId, readSeg, readOff)));
+						(errmsg("there is no contrecord flag in log segment %s, offset %u",
+								XLogFileNameP(curFileTLI, readSegNo),
+								readOff)));
 				goto next_record_is_invalid;
 			}
 			pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
@@ -4025,10 +3963,13 @@ retry:
 			if (contrecord->xl_rem_len == 0 ||
 				total_len != (contrecord->xl_rem_len + gotlen))
 			{
+				char fname[MAXFNAMELEN];
+				XLogFileName(fname, curFileTLI, readSegNo);
 				ereport(emode_for_corrupt_record(emode, *RecPtr),
-						(errmsg("invalid contrecord length %u in log file %u, segment %u, offset %u",
+						(errmsg("invalid contrecord length %u in log segment %s, offset %u",
 								contrecord->xl_rem_len,
-								readId, readSeg, readOff)));
+								XLogFileNameP(curFileTLI, readSegNo),
+								readOff)));
 				goto next_record_is_invalid;
 			}
 			len = XLOG_BLCKSZ - pageHeaderSize - SizeOfXLogContRecord;
@@ -4046,11 +3987,11 @@ retry:
 		if (!RecordIsValid(record, *RecPtr, emode))
 			goto next_record_is_invalid;
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
-		EndRecPtr.xlogid = readId;
-		EndRecPtr.xrecoff = readSeg * XLogSegSize + readOff +
-			pageHeaderSize +
-			MAXALIGN(SizeOfXLogContRecord + contrecord->xl_rem_len);
-
+		XLogSegNoOffsetToRecPtr(
+			readSegNo,
+			readOff + pageHeaderSize +
+				MAXALIGN(SizeOfXLogContRecord + contrecord->xl_rem_len),
+			EndRecPtr);
 		ReadRecPtr = *RecPtr;
 		/* needn't worry about XLOG SWITCH, it can't cross page boundaries */
 		return record;
@@ -4110,21 +4051,24 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 {
 	XLogRecPtr	recaddr;
 
-	recaddr.xlogid = readId;
-	recaddr.xrecoff = readSeg * XLogSegSize + readOff;
+	XLogSegNoOffsetToRecPtr(readSegNo, readOff, recaddr);
 
 	if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
 	{
 		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("invalid magic number %04X in log file %u, segment %u, offset %u",
-						hdr->xlp_magic, readId, readSeg, readOff)));
+				(errmsg("invalid magic number %04X in log segment %s, offset %u",
+						hdr->xlp_magic,
+						XLogFileNameP(curFileTLI, readSegNo),
+						readOff)));
 		return false;
 	}
 	if ((hdr->xlp_info & ~XLP_ALL_FLAGS) != 0)
 	{
 		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("invalid info bits %04X in log file %u, segment %u, offset %u",
-						hdr->xlp_info, readId, readSeg, readOff)));
+				(errmsg("invalid info bits %04X in log segment %s, offset %u",
+						hdr->xlp_info,
+						XLogFileNameP(curFileTLI, readSegNo),
+						readOff)));
 		return false;
 	}
 	if (hdr->xlp_info & XLP_LONG_HEADER)
@@ -4169,17 +4113,20 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 	{
 		/* hmm, first page of file doesn't have a long header? */
 		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("invalid info bits %04X in log file %u, segment %u, offset %u",
-						hdr->xlp_info, readId, readSeg, readOff)));
+				(errmsg("invalid info bits %04X in log segment %s, offset %u",
+						hdr->xlp_info,
+						XLogFileNameP(curFileTLI, readSegNo),
+						readOff)));
 		return false;
 	}
 
 	if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
 	{
 		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
+				(errmsg("unexpected pageaddr %X/%X in log segment %s, offset %u",
 						hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
-						readId, readSeg, readOff)));
+						XLogFileNameP(curFileTLI, readSegNo),
+						readOff)));
 		return false;
 	}
 
@@ -4189,9 +4136,10 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 	if (!list_member_int(expectedTLIs, (int) hdr->xlp_tli))
 	{
 		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("unexpected timeline ID %u in log file %u, segment %u, offset %u",
+				(errmsg("unexpected timeline ID %u in log segment %s, offset %u",
 						hdr->xlp_tli,
-						readId, readSeg, readOff)));
+						XLogFileNameP(curFileTLI, readSegNo),
+						readOff)));
 		return false;
 	}
 
@@ -4207,9 +4155,10 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 	if (hdr->xlp_tli < lastPageTLI)
 	{
 		ereport(emode_for_corrupt_record(emode, recaddr),
-				(errmsg("out-of-sequence timeline ID %u (after %u) in log file %u, segment %u, offset %u",
+				(errmsg("out-of-sequence timeline ID %u (after %u) in log segment %s, offset %u",
 						hdr->xlp_tli, lastPageTLI,
-						readId, readSeg, readOff)));
+						XLogFileNameP(curFileTLI, readSegNo),
+						readOff)));
 		return false;
 	}
 	lastPageTLI = hdr->xlp_tli;
@@ -4456,7 +4405,7 @@ findNewestTimeLine(TimeLineID startTLI)
  */
 static void
 writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
-					 TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
+					 TimeLineID endTLI, XLogSegNo endLogSegNo)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -4546,7 +4495,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * If we did have a parent file, insert an extra newline just in case the
 	 * parent file failed to end with one.
 	 */
-	XLogFileName(xlogfname, endTLI, endLogId, endLogSeg);
+	XLogFileName(xlogfname, endTLI, endLogSegNo);
 
 	/*
 	 * Write comment to history file to explain why and where timeline
@@ -5232,7 +5181,7 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(0, 1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent, false);
 
 	/* Write the first page with the initial record */
 	errno = 0;
@@ -5543,7 +5492,7 @@ readRecoveryCommandFile(void)
  * Exit archive-recovery state
  */
 static void
-exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
+exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
 {
 	char		recoveryPath[MAXPGPATH];
 	char		xlogpath[MAXPGPATH];
@@ -5579,12 +5528,11 @@ exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
 	 */
 	if (endTLI != ThisTimeLineID)
 	{
-		XLogFileCopy(endLogId, endLogSeg,
-					 endTLI, endLogId, endLogSeg);
+		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo);
 
 		if (XLogArchivingActive())
 		{
-			XLogFileName(xlogpath, endTLI, endLogId, endLogSeg);
+			XLogFileName(xlogpath, endTLI, endLogSegNo);
 			XLogArchiveNotify(xlogpath);
 		}
 	}
@@ -5593,7 +5541,7 @@ exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
 	 * Let's just make real sure there are not .ready or .done flags posted
 	 * for the new segment.
 	 */
-	XLogFileName(xlogpath, ThisTimeLineID, endLogId, endLogSeg);
+	XLogFileName(xlogpath, ThisTimeLineID, endLogSegNo);
 	XLogArchiveCleanup(xlogpath);
 
 	/*
@@ -5993,8 +5941,7 @@ StartupXLOG(void)
 	XLogRecPtr	RecPtr,
 				checkPointLoc,
 				EndOfLog;
-	uint32		endLogId;
-	uint32		endLogSeg;
+	XLogSegNo	endLogSegNo;
 	XLogRecord *record;
 	uint32		freespace;
 	TransactionId oldestActiveXID;
@@ -6720,7 +6667,7 @@ StartupXLOG(void)
 	 */
 	record = ReadRecord(&LastRec, PANIC, false);
 	EndOfLog = EndRecPtr;
-	XLByteToPrevSeg(EndOfLog, endLogId, endLogSeg);
+	XLByteToPrevSeg(EndOfLog, endLogSegNo);
 
 	/*
 	 * Complain if we did not roll forward far enough to render the backup
@@ -6785,7 +6732,7 @@ StartupXLOG(void)
 		ereport(LOG,
 				(errmsg("selected new timeline ID: %u", ThisTimeLineID)));
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
-							 curFileTLI, endLogId, endLogSeg);
+							 curFileTLI, endLogSegNo);
 	}
 
 	/* Save the selected TimeLineID in shared memory, too */
@@ -6798,20 +6745,19 @@ StartupXLOG(void)
 	 * we will use that below.)
 	 */
 	if (InArchiveRecovery)
-		exitArchiveRecovery(curFileTLI, endLogId, endLogSeg);
+		exitArchiveRecovery(curFileTLI, endLogSegNo);
 
 	/*
 	 * Prepare to write WAL starting at EndOfLog position, and init xlog
 	 * buffer cache using the block containing the last record from the
 	 * previous incarnation.
 	 */
-	openLogId = endLogId;
-	openLogSeg = endLogSeg;
-	openLogFile = XLogFileOpen(openLogId, openLogSeg);
+	openLogSegNo = endLogSegNo;
+	openLogFile = XLogFileOpen(openLogSegNo);
 	openLogOff = 0;
 	Insert = &XLogCtl->Insert;
 	Insert->PrevRecord = LastRec;
-	XLogCtl->xlblocks[0].xlogid = openLogId;
+	XLogCtl->xlblocks[0].xlogid = (openLogSegNo * XLOG_SEG_SIZE) >> 32;
 	XLogCtl->xlblocks[0].xrecoff =
 		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
 
@@ -7632,12 +7578,9 @@ CreateCheckPoint(int flags)
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	XLogRecData rdata;
 	uint32		freespace;
-	uint32		_logId;
-	uint32		_logSeg;
-	uint32		redo_logId;
-	uint32		redo_logSeg;
-	uint32		insert_logId;
-	uint32		insert_logSeg;
+	XLogSegNo	_logSegNo;
+	XLogSegNo	redo_logSegNo;
+	XLogSegNo	insert_logSegNo;
 	TransactionId *inCommitXids;
 	int			nInCommit;
 
@@ -7735,10 +7678,9 @@ CreateCheckPoint(int flags)
 		XLogRecPtr	curInsert;
 
 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
-		XLByteToSeg(curInsert, insert_logId, insert_logSeg);
-		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logId, redo_logSeg);
-		if (insert_logId == redo_logId &&
-			insert_logSeg == redo_logSeg)
+		XLByteToSeg(curInsert, insert_logSegNo);
+		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logSegNo);
+		if (insert_logSegNo == redo_logSegNo)
 		{
 			LWLockRelease(WALInsertLock);
 			LWLockRelease(CheckpointLock);
@@ -7938,7 +7880,7 @@ CreateCheckPoint(int flags)
 	 * Select point at which we can truncate the log, which we base on the
 	 * prior checkpoint's earliest info.
 	 */
-	XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
+	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
 
 	/*
 	 * Update the control file.
@@ -7981,11 +7923,11 @@ CreateCheckPoint(int flags)
 	 * Delete old log files (those no longer needed even for previous
 	 * checkpoint or the standbys in XLOG streaming).
 	 */
-	if (_logId || _logSeg)
+	if (_logSegNo)
 	{
-		KeepLogSeg(recptr, &_logId, &_logSeg);
-		PrevLogSeg(_logId, _logSeg);
-		RemoveOldXlogFiles(_logId, _logSeg, recptr);
+		KeepLogSeg(recptr, &_logSegNo);
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, recptr);
 	}
 
 	/*
@@ -8116,8 +8058,7 @@ CreateRestartPoint(int flags)
 {
 	XLogRecPtr	lastCheckPointRecPtr;
 	CheckPoint	lastCheckPoint;
-	uint32		_logId;
-	uint32		_logSeg;
+	XLogSegNo	_logSegNo;
 	TimestampTz xtime;
 
 	/* use volatile pointer to prevent code rearrangement */
@@ -8215,7 +8156,7 @@ CreateRestartPoint(int flags)
 	 * Select point at which we can truncate the xlog, which we base on the
 	 * prior checkpoint's earliest info.
 	 */
-	XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
+	XLByteToSeg(ControlFile->checkPointCopy.redo, _logSegNo);
 
 	/*
 	 * Update pg_control, using current time.  Check that it still shows
@@ -8242,16 +8183,16 @@ CreateRestartPoint(int flags)
 	 * checkpoint/restartpoint) to prevent the disk holding the xlog from
 	 * growing full.
 	 */
-	if (_logId || _logSeg)
+	if (_logSegNo)
 	{
 		XLogRecPtr	endptr;
 
 		/* Get the current (or recent) end of xlog */
 		endptr = GetStandbyFlushRecPtr();
 
-		KeepLogSeg(endptr, &_logId, &_logSeg);
-		PrevLogSeg(_logId, _logSeg);
-		RemoveOldXlogFiles(_logId, _logSeg, endptr);
+		KeepLogSeg(endptr, &_logSegNo);
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, endptr);
 
 		/*
 		 * Make more log segments if needed.  (Do this after recycling old log
@@ -8299,42 +8240,24 @@ CreateRestartPoint(int flags)
  * the given xlog location, recptr.
  */
 static void
-KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg)
+KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
-	uint32		log;
-	uint32		seg;
-	int			d_log;
-	int			d_seg;
+	XLogSegNo	segno;
 
 	if (wal_keep_segments == 0)
 		return;
 
-	XLByteToSeg(recptr, log, seg);
+	XLByteToSeg(recptr, segno);
 
-	d_seg = wal_keep_segments % XLogSegsPerFile;
-	d_log = wal_keep_segments / XLogSegsPerFile;
-	if (seg < d_seg)
-	{
-		d_log += 1;
-		seg = seg - d_seg + XLogSegsPerFile;
-	}
-	else
-		seg = seg - d_seg;
-	/* avoid underflow, don't go below (0,1) */
-	if (log < d_log || (log == d_log && seg == 0))
-	{
-		log = 0;
-		seg = 1;
-	}
+	/* avoid underflow, don't go below 1 */
+	if (segno <= wal_keep_segments)
+		segno = 1;
 	else
-		log = log - d_log;
+		segno = *logSegNo - wal_keep_segments;
 
 	/* don't delete WAL segments newer than the calculated segment */
-	if (log < *logId || (log == *logId && seg < *logSeg))
-	{
-		*logId = log;
-		*logSeg = seg;
-	}
+	if (segno < *logSegNo)
+		*logSegNo = segno;
 }
 
 /*
@@ -8999,8 +8922,8 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 			if (pg_fsync(openLogFile) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
-						 errmsg("could not fsync log file %u, segment %u: %m",
-								openLogId, openLogSeg)));
+						 errmsg("could not fsync log segment %s: %m",
+								XLogFileNameP(curFileTLI, readSegNo))));
 			if (get_sync_bit(sync_method) != get_sync_bit(new_sync_method))
 				XLogFileClose();
 		}
@@ -9015,7 +8938,7 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
  * 'log' and 'seg' are for error reporting purposes.
  */
 void
-issue_xlog_fsync(int fd, uint32 log, uint32 seg)
+issue_xlog_fsync(int fd, XLogSegNo segno)
 {
 	switch (sync_method)
 	{
@@ -9023,16 +8946,16 @@ issue_xlog_fsync(int fd, uint32 log, uint32 seg)
 			if (pg_fsync_no_writethrough(fd) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
-						 errmsg("could not fsync log file %u, segment %u: %m",
-								log, seg)));
+						 errmsg("could not fsync log file %s: %m",
+								XLogFileNameP(ThisTimeLineID, openLogSegNo))));
 			break;
 #ifdef HAVE_FSYNC_WRITETHROUGH
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 			if (pg_fsync_writethrough(fd) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
-						 errmsg("could not fsync write-through log file %u, segment %u: %m",
-								log, seg)));
+						 errmsg("could not fsync write-through log file %s: %m",
+								XLogFileNameP(ThisTimeLineID, openLogSegNo)))));
 			break;
 #endif
 #ifdef HAVE_FDATASYNC
@@ -9040,8 +8963,8 @@ issue_xlog_fsync(int fd, uint32 log, uint32 seg)
 			if (pg_fdatasync(fd) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
-					errmsg("could not fdatasync log file %u, segment %u: %m",
-						   log, seg)));
+						 errmsg("could not fdatasync log file %s: %m",
+								XLogFileNameP(ThisTimeLineID, openLogSegNo))));
 			break;
 #endif
 		case SYNC_METHOD_OPEN:
@@ -9055,6 +8978,17 @@ issue_xlog_fsync(int fd, uint32 log, uint32 seg)
 }
 
 /*
+ * Return the filename of given log segment, as a palloc'd string.
+ */
+char *
+XLogFileNameP(TimeLineID tli, XLogSegNo segno)
+{
+	char	   *result = palloc(MAXFNAMELEN);
+	XLogFileName(result, tli, segno);
+	return result;
+}
+
+/*
  * do_pg_start_backup is the workhorse of the user-visible pg_start_backup()
  * function. It creates the necessary starting checkpoint and constructs the
  * backup label file.
@@ -9085,8 +9019,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 	pg_time_t	stamp_time;
 	char		strfbuf[128];
 	char		xlogfilename[MAXFNAMELEN];
-	uint32		_logId;
-	uint32		_logSeg;
+	XLogSegNo	_logSegNo;
 	struct stat stat_buf;
 	FILE	   *fp;
 	StringInfoData labelfbuf;
@@ -9281,8 +9214,8 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 			LWLockRelease(WALInsertLock);
 		} while (!gotUniqueStartpoint);
 
-		XLByteToSeg(startpoint, _logId, _logSeg);
-		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
+		XLByteToSeg(startpoint, _logSegNo);
+		XLogFileName(xlogfilename, ThisTimeLineID, _logSegNo);
 
 		/*
 		 * Construct backup label file
@@ -9408,8 +9341,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	char		lastxlogfilename[MAXFNAMELEN];
 	char		histfilename[MAXFNAMELEN];
 	char		backupfrom[20];
-	uint32		_logId;
-	uint32		_logSeg;
+	XLogSegNo	_logSegNo;
 	FILE	   *lfp;
 	FILE	   *fp;
 	char		ch;
@@ -9620,8 +9552,8 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	 */
 	RequestXLogSwitch();
 
-	XLByteToPrevSeg(stoppoint, _logId, _logSeg);
-	XLogFileName(stopxlogfilename, ThisTimeLineID, _logId, _logSeg);
+	XLByteToPrevSeg(stoppoint, _logSegNo);
+	XLogFileName(stopxlogfilename, ThisTimeLineID, _logSegNo);
 
 	/* Use the log timezone here, not the session timezone */
 	stamp_time = (pg_time_t) time(NULL);
@@ -9632,8 +9564,8 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	/*
 	 * Write the backup history file
 	 */
-	XLByteToSeg(startpoint, _logId, _logSeg);
-	BackupHistoryFilePath(histfilepath, ThisTimeLineID, _logId, _logSeg,
+	XLByteToSeg(startpoint, _logSegNo);
+	BackupHistoryFilePath(histfilepath, ThisTimeLineID, _logSegNo,
 						  startpoint.xrecoff % XLogSegSize);
 	fp = AllocateFile(histfilepath, "w");
 	if (!fp)
@@ -9682,11 +9614,11 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	 */
 	if (waitforarchive && XLogArchivingActive())
 	{
-		XLByteToPrevSeg(stoppoint, _logId, _logSeg);
-		XLogFileName(lastxlogfilename, ThisTimeLineID, _logId, _logSeg);
+		XLByteToPrevSeg(stoppoint, _logSegNo);
+		XLogFileName(lastxlogfilename, ThisTimeLineID, _logSegNo);
 
-		XLByteToSeg(startpoint, _logId, _logSeg);
-		BackupHistoryFileName(histfilename, ThisTimeLineID, _logId, _logSeg,
+		XLByteToSeg(startpoint, _logSegNo);
+		BackupHistoryFileName(histfilename, ThisTimeLineID, _logSegNo,
 							  startpoint.xrecoff % XLogSegSize);
 
 		seconds_before_warning = 60;
@@ -10023,16 +9955,15 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 	bool		switched_segment = false;
 	uint32		targetPageOff;
 	uint32		targetRecOff;
-	uint32		targetId;
-	uint32		targetSeg;
+	XLogSegNo	targetSegNo;
 	static pg_time_t last_fail_time = 0;
 
-	XLByteToSeg(*RecPtr, targetId, targetSeg);
+	XLByteToSeg(*RecPtr, targetSegNo);
 	targetPageOff = ((RecPtr->xrecoff % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
 	targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;
 
 	/* Fast exit if we have read the record in the current buffer already */
-	if (failedSources == 0 && targetId == readId && targetSeg == readSeg &&
+	if (failedSources == 0 && targetSegNo == readSegNo &&
 		targetPageOff == readOff && targetRecOff < readLen)
 		return true;
 
@@ -10040,7 +9971,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readId, readSeg))
+	if (readFile >= 0 && !XLByteInSeg(*RecPtr, readSegNo))
 	{
 		/*
 		 * Request a restartpoint if we've replayed too much
@@ -10048,10 +9979,10 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 		 */
 		if (StandbyMode && bgwriterLaunched)
 		{
-			if (XLogCheckpointNeeded(readId, readSeg))
+			if (XLogCheckpointNeeded(readSegNo))
 			{
 				(void) GetRedoRecPtr();
-				if (XLogCheckpointNeeded(readId, readSeg))
+				if (XLogCheckpointNeeded(readSegNo))
 					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
 			}
 		}
@@ -10061,7 +9992,7 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 		readSource = 0;
 	}
 
-	XLByteToSeg(*RecPtr, readId, readSeg);
+	XLByteToSeg(*RecPtr, readSegNo);
 
 retry:
 	/* See if we need to retrieve more data */
@@ -10139,7 +10070,7 @@ retry:
 						if (readFile < 0)
 						{
 							readFile =
-								XLogFileRead(readId, readSeg, PANIC,
+								XLogFileRead(readSegNo, PANIC,
 											 recoveryTargetTLI,
 											 XLOG_FROM_STREAM, false);
 							Assert(readFile >= 0);
@@ -10245,7 +10176,7 @@ retry:
 					}
 					/* Don't try to read from a source that just failed */
 					sources &= ~failedSources;
-					readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2,
+					readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
 												  sources);
 					switched_segment = true;
 					if (readFile >= 0)
@@ -10288,8 +10219,7 @@ retry:
 				if (InArchiveRecovery)
 					sources |= XLOG_FROM_ARCHIVE;
 
-				readFile = XLogFileReadAnyTLI(readId, readSeg, emode,
-											  sources);
+				readFile = XLogFileReadAnyTLI(readSegNo, emode, sources);
 				switched_segment = true;
 				if (readFile < 0)
 					return false;
@@ -10334,10 +10264,12 @@ retry:
 		readOff = 0;
 		if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 		{
+			char fname[MAXFNAMELEN];
+			XLogFileName(fname, curFileTLI, readSegNo);
 			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errcode_for_file_access(),
-					 errmsg("could not read from log file %u, segment %u, offset %u: %m",
-							readId, readSeg, readOff)));
+					 errmsg("could not read from log segment %s, offset %u: %m",
+							fname, readOff)));
 			goto next_record_is_invalid;
 		}
 		if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
@@ -10348,25 +10280,28 @@ retry:
 	readOff = targetPageOff;
 	if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
 	{
+		char fname[MAXFNAMELEN];
+		XLogFileName(fname, curFileTLI, readSegNo);
 		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errcode_for_file_access(),
-		 errmsg("could not seek in log file %u, segment %u to offset %u: %m",
-				readId, readSeg, readOff)));
+		 errmsg("could not seek in log segment %s to offset %u: %m",
+				fname, readOff)));
 		goto next_record_is_invalid;
 	}
 	if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 	{
+		char fname[MAXFNAMELEN];
+		XLogFileName(fname, curFileTLI, readSegNo);
 		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errcode_for_file_access(),
-		 errmsg("could not read from log file %u, segment %u, offset %u: %m",
-				readId, readSeg, readOff)));
+		 errmsg("could not read from log segment %s, offset %u: %m",
+				fname, readOff)));
 		goto next_record_is_invalid;
 	}
 	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
 		goto next_record_is_invalid;
 
-	Assert(targetId == readId);
-	Assert(targetSeg == readSeg);
+	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(targetRecOff < readLen);
 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index f3c8a09..a289baa 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -271,8 +271,7 @@ pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
 	char	   *locationstr;
 	unsigned int uxlogid;
 	unsigned int uxrecoff;
-	uint32		xlogid;
-	uint32		xlogseg;
+	XLogSegNo	xlogsegno;
 	uint32		xrecoff;
 	XLogRecPtr	locationpoint;
 	char		xlogfilename[MAXFNAMELEN];
@@ -319,8 +318,8 @@ pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
 	/*
 	 * xlogfilename
 	 */
-	XLByteToPrevSeg(locationpoint, xlogid, xlogseg);
-	XLogFileName(xlogfilename, ThisTimeLineID, xlogid, xlogseg);
+	XLByteToPrevSeg(locationpoint, xlogsegno);
+	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno);
 
 	values[0] = CStringGetTextDatum(xlogfilename);
 	isnull[0] = false;
@@ -328,7 +327,7 @@ pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
 	/*
 	 * offset
 	 */
-	xrecoff = locationpoint.xrecoff - xlogseg * XLogSegSize;
+	xrecoff = locationpoint.xrecoff % XLogSegSize;
 
 	values[1] = UInt32GetDatum(xrecoff);
 	isnull[1] = false;
@@ -354,8 +353,7 @@ pg_xlogfile_name(PG_FUNCTION_ARGS)
 	char	   *locationstr;
 	unsigned int uxlogid;
 	unsigned int uxrecoff;
-	uint32		xlogid;
-	uint32		xlogseg;
+	XLogSegNo	xlogsegno;
 	XLogRecPtr	locationpoint;
 	char		xlogfilename[MAXFNAMELEN];
 
@@ -378,8 +376,8 @@ pg_xlogfile_name(PG_FUNCTION_ARGS)
 	locationpoint.xlogid = uxlogid;
 	locationpoint.xrecoff = uxrecoff;
 
-	XLByteToPrevSeg(locationpoint, xlogid, xlogseg);
-	XLogFileName(xlogfilename, ThisTimeLineID, xlogid, xlogseg);
+	XLByteToPrevSeg(locationpoint, xlogsegno);
+	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno);
 
 	PG_RETURN_TEXT_P(cstring_to_text(xlogfilename));
 }
@@ -514,6 +512,8 @@ pg_xlog_location_diff(PG_FUNCTION_ARGS)
 	XLogRecPtr	loc1,
 				loc2;
 	Numeric		result;
+	uint64		bytes1,
+				bytes2;
 
 	/*
 	 * Read and parse input
@@ -533,33 +533,17 @@ pg_xlog_location_diff(PG_FUNCTION_ARGS)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 		   errmsg("could not parse transaction log location \"%s\"", str2)));
 
-	/*
-	 * Sanity check
-	 */
-	if (loc1.xrecoff > XLogFileSize)
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-				 errmsg("xrecoff \"%X\" is out of valid range, 0..%X", loc1.xrecoff, XLogFileSize)));
-	if (loc2.xrecoff > XLogFileSize)
-		ereport(ERROR,
-				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-				 errmsg("xrecoff \"%X\" is out of valid range, 0..%X", loc2.xrecoff, XLogFileSize)));
+	bytes1 = (((uint64)loc1.xlogid) << 32L) + loc1.xrecoff;
+	bytes2 = (((uint64)loc2.xlogid) << 32L) + loc2.xrecoff;
 
 	/*
-	 * result = XLogFileSize * (xlogid1 - xlogid2) + xrecoff1 - xrecoff2
+	 * result = bytes1 - bytes2.
+	 *
+	 * XXX: this won't handle values higher than 2^63 correctly.
 	 */
 	result = DatumGetNumeric(DirectFunctionCall2(numeric_sub,
-	   DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) loc1.xlogid)),
-	 DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) loc2.xlogid))));
-	result = DatumGetNumeric(DirectFunctionCall2(numeric_mul,
-	  DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) XLogFileSize)),
-												 NumericGetDatum(result)));
-	result = DatumGetNumeric(DirectFunctionCall2(numeric_add,
-												 NumericGetDatum(result),
-	DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) loc1.xrecoff))));
-	result = DatumGetNumeric(DirectFunctionCall2(numeric_sub,
-												 NumericGetDatum(result),
-	DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) loc2.xrecoff))));
+	   DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) bytes1)),
+	   DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) bytes2))));
 
 	PG_RETURN_NUMERIC(result);
 }
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 6aeade9..39229eb 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -776,7 +776,7 @@ IsCheckpointOnSchedule(double progress)
 	{
 		recptr = GetInsertRecPtr();
 		elapsed_xlogs =
-			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+			(((double) ((uint64) (recptr.xlogid - ckpt_start_recptr.xlogid) << 32L)) +
 			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
 			CheckPointSegments;
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 72e79ce..f5b8e32 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -220,10 +220,8 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 		 * We've left the last tar file "open", so we can now append the
 		 * required WAL files to it.
 		 */
-		uint32		logid,
-					logseg;
-		uint32		endlogid,
-					endlogseg;
+		XLogSegNo	logsegno;
+		XLogSegNo	endlogsegno;
 		struct stat statbuf;
 
 		MemSet(&statbuf, 0, sizeof(statbuf));
@@ -235,8 +233,8 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 		statbuf.st_size = XLogSegSize;
 		statbuf.st_mtime = time(NULL);
 
-		XLByteToSeg(startptr, logid, logseg);
-		XLByteToPrevSeg(endptr, endlogid, endlogseg);
+		XLByteToSeg(startptr, logsegno);
+		XLByteToPrevSeg(endptr, endlogsegno);
 
 		while (true)
 		{
@@ -244,7 +242,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 			char		fn[MAXPGPATH];
 			int			i;
 
-			XLogFilePath(fn, ThisTimeLineID, logid, logseg);
+			XLogFilePath(fn, ThisTimeLineID, logsegno);
 			_tarWriteHeader(fn, NULL, &statbuf);
 
 			/* Send the actual WAL file contents, block-by-block */
@@ -253,8 +251,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 				char		buf[TAR_SEND_SIZE];
 				XLogRecPtr	ptr;
 
-				ptr.xlogid = logid;
-				ptr.xrecoff = logseg * XLogSegSize + TAR_SEND_SIZE * i;
+				XLogSegNoOffsetToRecPtr(logsegno, TAR_SEND_SIZE * i, ptr);
 
 				/*
 				 * Some old compilers, e.g. gcc 2.95.3/x86, think that passing
@@ -276,11 +273,10 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 
 
 			/* Advance to the next WAL file */
-			NextLogSeg(logid, logseg);
+			logsegno++;
 
 			/* Have we reached our stop position yet? */
-			if (logid > endlogid ||
-				(logid == endlogid && logseg > endlogseg))
+			if (logsegno > endlogsegno)
 				break;
 		}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d63ff29..8cbfd7b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -69,11 +69,12 @@ walrcv_disconnect_type walrcv_disconnect = NULL;
 
 /*
  * These variables are used similarly to openLogFile/Id/Seg/Off,
- * but for walreceiver to write the XLOG.
+ * but for walreceiver to write the XLOG. recvFileTLI is the TimeLineID
+ * corresponding the filename of recvFile, used for error messages.
  */
 static int	recvFile = -1;
-static uint32 recvId = 0;
-static uint32 recvSeg = 0;
+static TimeLineID	recvFileTLI = -1;
+static XLogSegNo recvSegNo = 0;
 static uint32 recvOff = 0;
 
 /*
@@ -481,7 +482,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvId, recvSeg))
+		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo))
 		{
 			bool		use_existent;
 
@@ -501,15 +502,16 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				if (close(recvFile) != 0)
 					ereport(PANIC,
 							(errcode_for_file_access(),
-						errmsg("could not close log file %u, segment %u: %m",
-							   recvId, recvSeg)));
+							 errmsg("could not close log segment %s: %m",
+									XLogFileNameP(recvFileTLI, recvSegNo))));
 			}
 			recvFile = -1;
 
 			/* Create/use new log file */
-			XLByteToSeg(recptr, recvId, recvSeg);
+			XLByteToSeg(recptr, recvSegNo);
 			use_existent = true;
-			recvFile = XLogFileInit(recvId, recvSeg, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFileTLI = ThisTimeLineID;
 			recvOff = 0;
 		}
 
@@ -527,9 +529,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
-						 errmsg("could not seek in log file %u, "
-								"segment %u to offset %u: %m",
-								recvId, recvSeg, startoff)));
+						 errmsg("could not seek in log segment %s, to offset %u: %m",
+								XLogFileNameP(recvFileTLI, recvSegNo),
+								startoff)));
 			recvOff = startoff;
 		}
 
@@ -544,9 +546,9 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				errno = ENOSPC;
 			ereport(PANIC,
 					(errcode_for_file_access(),
-					 errmsg("could not write to log file %u, segment %u "
+					 errmsg("could not write to log segment %s "
 							"at offset %u, length %lu: %m",
-							recvId, recvSeg,
+							XLogFileNameP(recvFileTLI, recvSegNo),
 							recvOff, (unsigned long) segbytes)));
 		}
 
@@ -575,7 +577,7 @@ XLogWalRcvFlush(bool dying)
 		/* use volatile pointer to prevent code rearrangement */
 		volatile WalRcvData *walrcv = WalRcv;
 
-		issue_xlog_fsync(recvFile, recvId, recvSeg);
+		issue_xlog_fsync(recvFile, recvSegNo);
 
 		LogstreamResult.Flush = LogstreamResult.Write;
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5f93812..3b26eff 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -86,8 +86,7 @@ int			replication_timeout = 60 * 1000;	/* maximum time to send one
  * but for walsender to read the XLOG.
  */
 static int	sendFile = -1;
-static uint32 sendId = 0;
-static uint32 sendSeg = 0;
+static XLogSegNo sendSegNo = 0;
 static uint32 sendOff = 0;
 
 /*
@@ -976,10 +975,8 @@ XLogRead(char *buf, XLogRecPtr startptr, Size count)
 	char		   *p;
 	XLogRecPtr	recptr;
 	Size			nbytes;
-	uint32		lastRemovedLog;
-	uint32		lastRemovedSeg;
-	uint32		log;
-	uint32		seg;
+	XLogSegNo		lastRemovedSegNo;
+	XLogSegNo		segno;
 
 retry:
 	p = buf;
@@ -994,7 +991,7 @@ retry:
 
 		startoff = recptr.xrecoff % XLogSegSize;
 
-		if (sendFile < 0 || !XLByteInSeg(recptr, sendId, sendSeg))
+		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
 		{
 			char		path[MAXPGPATH];
 
@@ -1002,8 +999,8 @@ retry:
 			if (sendFile >= 0)
 				close(sendFile);
 
-			XLByteToSeg(recptr, sendId, sendSeg);
-			XLogFilePath(path, ThisTimeLineID, sendId, sendSeg);
+			XLByteToSeg(recptr, sendSegNo);
+			XLogFilePath(path, ThisTimeLineID, sendSegNo);
 
 			sendFile = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0);
 			if (sendFile < 0)
@@ -1014,20 +1011,15 @@ retry:
 				 * removed or recycled.
 				 */
 				if (errno == ENOENT)
-				{
-					char		filename[MAXFNAMELEN];
-
-					XLogFileName(filename, ThisTimeLineID, sendId, sendSeg);
 					ereport(ERROR,
 							(errcode_for_file_access(),
 							 errmsg("requested WAL segment %s has already been removed",
-									filename)));
-				}
+									XLogFileNameP(ThisTimeLineID, sendSegNo))));
 				else
 					ereport(ERROR,
 							(errcode_for_file_access(),
-							 errmsg("could not open file \"%s\" (log file %u, segment %u): %m",
-									path, sendId, sendSeg)));
+							 errmsg("could not open file \"%s\": %m",
+									path)));
 			}
 			sendOff = 0;
 		}
@@ -1038,8 +1030,9 @@ retry:
 			if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
-						 errmsg("could not seek in log file %u, segment %u to offset %u: %m",
-								sendId, sendSeg, startoff)));
+						 errmsg("could not seek in log segment %s to offset %u: %m",
+								XLogFileNameP(ThisTimeLineID, sendSegNo),
+								startoff)));
 			sendOff = startoff;
 		}
 
@@ -1051,11 +1044,13 @@ retry:
 
 		readbytes = read(sendFile, p, segbytes);
 		if (readbytes <= 0)
+		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
-			errmsg("could not read from log file %u, segment %u, offset %u, "
-				   "length %lu: %m",
-				   sendId, sendSeg, sendOff, (unsigned long) segbytes)));
+			errmsg("could not read from log segment %s, offset %u, length %lu: %m",
+				   XLogFileNameP(ThisTimeLineID, sendSegNo),
+				   sendOff, (unsigned long) segbytes)));
+		}
 
 		/* Update state for read */
 		XLByteAdvance(recptr, readbytes);
@@ -1072,19 +1067,13 @@ retry:
 	 * read() succeeds in that case, but the data we tried to read might
 	 * already have been overwritten with new WAL records.
 	 */
-	XLogGetLastRemoved(&lastRemovedLog, &lastRemovedSeg);
-	XLByteToSeg(startptr, log, seg);
-	if (log < lastRemovedLog ||
-		(log == lastRemovedLog && seg <= lastRemovedSeg))
-	{
-		char		filename[MAXFNAMELEN];
-
-		XLogFileName(filename, ThisTimeLineID, log, seg);
+	XLogGetLastRemoved(&lastRemovedSegNo);
+	XLByteToSeg(startptr, segno);
+	if (segno <= lastRemovedSegNo)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("requested WAL segment %s has already been removed",
-						filename)));
-	}
+						XLogFileNameP(ThisTimeLineID, segno))));
 
 	/*
 	 * During recovery, the currently-open WAL file might be replaced with
@@ -1164,24 +1153,13 @@ XLogSend(char *msgbuf, bool *caughtup)
 	 * SendRqstPtr never points to the middle of a WAL record.
 	 */
 	startptr = sentPtr;
-	if (startptr.xrecoff >= XLogFileSize)
-	{
-		/*
-		 * crossing a logid boundary, skip the non-existent last log segment
-		 * in previous logical log file.
-		 */
-		startptr.xlogid += 1;
-		startptr.xrecoff = 0;
-	}
-
 	endptr = startptr;
 	XLByteAdvance(endptr, MAX_SEND_SIZE);
 	if (endptr.xlogid != startptr.xlogid)
 	{
 		/* Don't cross a logfile boundary within one message */
 		Assert(endptr.xlogid == startptr.xlogid + 1);
-		endptr.xlogid = startptr.xlogid;
-		endptr.xrecoff = XLogFileSize;
+		endptr.xrecoff = 0;
 	}
 
 	/* if we went beyond SendRqstPtr, back off */
@@ -1197,7 +1175,10 @@ XLogSend(char *msgbuf, bool *caughtup)
 		*caughtup = false;
 	}
 
-	nbytes = endptr.xrecoff - startptr.xrecoff;
+	if (endptr.xrecoff == 0)
+		nbytes = 0x100000000L - (uint64) startptr.xrecoff;
+	else
+		nbytes = endptr.xrecoff - startptr.xrecoff;
 	Assert(nbytes <= MAX_SEND_SIZE);
 
 	/*
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index 084ddc4..5a7ad81 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -103,8 +103,7 @@ FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
 	struct dirent *dirent;
 	int			i;
 	bool		b;
-	uint32		high_log = 0;
-	uint32		high_seg = 0;
+	XLogSegNo	high_segno = 0;
 
 	dir = opendir(basedir);
 	if (dir == NULL)
@@ -118,9 +117,10 @@ FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
 	{
 		char		fullpath[MAXPGPATH];
 		struct stat statbuf;
-		uint32		tli,
-					log,
+		uint32		tli;
+		unsigned int log,
 					seg;
+		XLogSegNo	segno;
 
 		if (strcmp(dirent->d_name, ".") == 0 || strcmp(dirent->d_name, "..") == 0)
 			continue;
@@ -152,6 +152,7 @@ FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
 					progname, dirent->d_name);
 			disconnect_and_exit(1);
 		}
+		segno = ((uint64) log) << 32 | seg;
 
 		/* Ignore any files that are for another timeline */
 		if (tli != currenttimeline)
@@ -169,11 +170,9 @@ FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
 		if (statbuf.st_size == XLOG_SEG_SIZE)
 		{
 			/* Completed segment */
-			if (log > high_log ||
-				(log == high_log && seg > high_seg))
+			if (segno > high_segno)
 			{
-				high_log = log;
-				high_seg = seg;
+				high_segno = segno;
 				continue;
 			}
 		}
@@ -187,17 +186,16 @@ FindStreamingStart(XLogRecPtr currentpos, uint32 currenttimeline)
 
 	closedir(dir);
 
-	if (high_log > 0 || high_seg > 0)
+	if (high_segno > 0)
 	{
 		XLogRecPtr	high_ptr;
 		/*
 		 * Move the starting pointer to the start of the next segment,
 		 * since the highest one we've seen was completed.
 		 */
-		NextLogSeg(high_log, high_seg);
+		high_segno++;
 
-		high_ptr.xlogid = high_log;
-		high_ptr.xrecoff = high_seg * XLOG_SEG_SIZE;
+		XLogSegNoOffsetToRecPtr(high_segno, 0, high_ptr);
 
 		return high_ptr;
 	}
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index efbc4ca..6cb209b 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -54,9 +54,10 @@ open_walfile(XLogRecPtr startpoint, uint32 timeline, char *basedir, char *namebu
 	struct stat	statbuf;
 	char	   *zerobuf;
 	int			bytes;
+	XLogSegNo	segno;
 
-	XLogFileName(namebuf, timeline, startpoint.xlogid,
-				 startpoint.xrecoff / XLOG_SEG_SIZE);
+	XLByteToSeg(startpoint, segno);
+	XLogFileName(namebuf, timeline, segno);
 
 	snprintf(fn, sizeof(fn), "%s/%s.partial", basedir, namebuf);
 	f = open(fn, O_WRONLY | O_CREAT | PG_BINARY, S_IRUSR | S_IWUSR);
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 8e2a253..0012cff 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -60,8 +60,7 @@ extern char *optarg;
 
 
 static ControlFileData ControlFile;		/* pg_control values */
-static uint32 newXlogId,
-			newXlogSeg;			/* ID/Segment of new XLOG segment */
+static XLogSegNo newXlogSegNo;	/* new XLOG segment # */
 static bool guessed = false;	/* T if we had to guess at any values */
 static const char *progname;
 
@@ -87,12 +86,9 @@ main(int argc, char *argv[])
 	Oid			set_oid = 0;
 	MultiXactId set_mxid = 0;
 	MultiXactOffset set_mxoff = (MultiXactOffset) -1;
-	uint32		minXlogTli = 0,
-				minXlogId = 0,
-				minXlogSeg = 0;
+	uint32		minXlogTli = 0;
+	XLogSegNo	minXlogSegNo = 0;
 	char	   *endptr;
-	char	   *endptr2;
-	char	   *endptr3;
 	char	   *DataDir;
 	int			fd;
 	char		path[MAXPGPATH];
@@ -204,27 +200,13 @@ main(int argc, char *argv[])
 				break;
 
 			case 'l':
-				minXlogTli = strtoul(optarg, &endptr, 0);
-				if (endptr == optarg || *endptr != ',')
-				{
-					fprintf(stderr, _("%s: invalid argument for option -l\n"), progname);
-					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
-					exit(1);
-				}
-				minXlogId = strtoul(endptr + 1, &endptr2, 0);
-				if (endptr2 == endptr + 1 || *endptr2 != ',')
-				{
-					fprintf(stderr, _("%s: invalid argument for option -l\n"), progname);
-					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
-					exit(1);
-				}
-				minXlogSeg = strtoul(endptr2 + 1, &endptr3, 0);
-				if (endptr3 == endptr2 + 1 || *endptr3 != '\0')
+				if (strspn(optarg, "01234567890ABCDEFabcdef") != 24)
 				{
 					fprintf(stderr, _("%s: invalid argument for option -l\n"), progname);
 					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 					exit(1);
 				}
+				XLogFromFileName(optarg, &minXlogTli, &minXlogSegNo);
 				break;
 
 			default:
@@ -295,7 +277,7 @@ main(int argc, char *argv[])
 		GuessControlValues();
 
 	/*
-	 * Also look at existing segment files to set up newXlogId/newXlogSeg
+	 * Also look at existing segment files to set up newXlogSegNo
 	 */
 	FindEndOfXLOG();
 
@@ -335,13 +317,8 @@ main(int argc, char *argv[])
 	if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID)
 		ControlFile.checkPointCopy.ThisTimeLineID = minXlogTli;
 
-	if (minXlogId > newXlogId ||
-		(minXlogId == newXlogId &&
-		 minXlogSeg > newXlogSeg))
-	{
-		newXlogId = minXlogId;
-		newXlogSeg = minXlogSeg;
-	}
+	if (minXlogSegNo > newXlogSegNo)
+		newXlogSegNo = minXlogSegNo;
 
 	/*
 	 * If we had to guess anything, and -f was not given, just print the
@@ -545,6 +522,7 @@ static void
 PrintControlValues(bool guessed)
 {
 	char		sysident_str[32];
+	char		fname[MAXFNAMELEN];
 
 	if (guessed)
 		printf(_("Guessed pg_control values:\n\n"));
@@ -558,10 +536,10 @@ PrintControlValues(bool guessed)
 	snprintf(sysident_str, sizeof(sysident_str), UINT64_FORMAT,
 			 ControlFile.system_identifier);
 
-	printf(_("First log file ID after reset:        %u\n"),
-		   newXlogId);
-	printf(_("First log file segment after reset:   %u\n"),
-		   newXlogSeg);
+	XLogFileName(fname, ControlFile.checkPointCopy.ThisTimeLineID, newXlogSegNo);
+
+	printf(_("First log segment after reset:        %s\n"),
+		   fname);
 	printf(_("pg_control version number:            %u\n"),
 		   ControlFile.pg_control_version);
 	printf(_("Catalog version number:               %u\n"),
@@ -624,11 +602,10 @@ RewriteControlFile(void)
 
 	/*
 	 * Adjust fields as needed to force an empty XLOG starting at
-	 * newXlogId/newXlogSeg.
+	 * newXlogSegNo.
 	 */
-	ControlFile.checkPointCopy.redo.xlogid = newXlogId;
-	ControlFile.checkPointCopy.redo.xrecoff =
-		newXlogSeg * XLogSegSize + SizeOfXLogLongPHD;
+	XLogSegNoOffsetToRecPtr(newXlogSegNo, SizeOfXLogLongPHD,
+							ControlFile.checkPointCopy.redo);
 	ControlFile.checkPointCopy.time = (pg_time_t) time(NULL);
 
 	ControlFile.state = DB_SHUTDOWNED;
@@ -728,14 +705,17 @@ FindEndOfXLOG(void)
 {
 	DIR		   *xldir;
 	struct dirent *xlde;
+	uint64		segs_per_xlogid;
+	uint64		xlogbytepos;
 
 	/*
 	 * Initialize the max() computation using the last checkpoint address from
 	 * old pg_control.	Note that for the moment we are working with segment
 	 * numbering according to the old xlog seg size.
 	 */
-	newXlogId = ControlFile.checkPointCopy.redo.xlogid;
-	newXlogSeg = ControlFile.checkPointCopy.redo.xrecoff / ControlFile.xlog_seg_size;
+	segs_per_xlogid = (0x100000000L / ControlFile.xlog_seg_size);
+	newXlogSegNo = ((uint64) ControlFile.checkPointCopy.redo.xlogid) * segs_per_xlogid
+		+ (ControlFile.checkPointCopy.redo.xrecoff / ControlFile.xlog_seg_size);
 
 	/*
 	 * Scan the pg_xlog directory to find existing WAL segment files. We
@@ -759,8 +739,10 @@ FindEndOfXLOG(void)
 			unsigned int tli,
 						log,
 						seg;
+			XLogSegNo	segno;
 
 			sscanf(xlde->d_name, "%08X%08X%08X", &tli, &log, &seg);
+			segno = ((uint64) log) * segs_per_xlogid + seg;
 
 			/*
 			 * Note: we take the max of all files found, regardless of their
@@ -768,12 +750,8 @@ FindEndOfXLOG(void)
 			 * timelines other than the target TLI, but this seems safer.
 			 * Better too large a result than too small...
 			 */
-			if (log > newXlogId ||
-				(log == newXlogId && seg > newXlogSeg))
-			{
-				newXlogId = log;
-				newXlogSeg = seg;
-			}
+			if (segno > newXlogSegNo)
+				newXlogSegNo = segno;
 		}
 		errno = 0;
 	}
@@ -799,11 +777,9 @@ FindEndOfXLOG(void)
 	 * Finally, convert to new xlog seg size, and advance by one to ensure we
 	 * are in virgin territory.
 	 */
-	newXlogSeg *= ControlFile.xlog_seg_size;
-	newXlogSeg = (newXlogSeg + XLogSegSize - 1) / XLogSegSize;
-
-	/* be sure we wrap around correctly at end of a logfile */
-	NextLogSeg(newXlogId, newXlogSeg);
+	xlogbytepos = newXlogSegNo * ControlFile.xlog_seg_size;
+	newXlogSegNo = (xlogbytepos + XLogSegSize - 1) / XLogSegSize;
+	newXlogSegNo++;
 }
 
 
@@ -972,8 +948,7 @@ WriteEmptyXLOG(void)
 	record->xl_crc = crc;
 
 	/* Write the first page */
-	XLogFilePath(path, ControlFile.checkPointCopy.ThisTimeLineID,
-				 newXlogId, newXlogSeg);
+	XLogFilePath(path, ControlFile.checkPointCopy.ThisTimeLineID, newXlogSegNo);
 
 	unlink(path);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index df5f232..b581910 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -267,12 +267,10 @@ extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int XLogFileInit(uint32 log, uint32 seg,
-			 bool *use_existent, bool use_lock);
-extern int	XLogFileOpen(uint32 log, uint32 seg);
+extern int XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
+extern int	XLogFileOpen(XLogSegNo segno);
 
-
-extern void XLogGetLastRemoved(uint32 *log, uint32 *seg);
+extern void XLogGetLastRemoved(XLogSegNo *segno);
 extern void XLogSetAsyncXactLSN(XLogRecPtr record);
 
 extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);
@@ -280,7 +278,7 @@ extern void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup);
 extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
 
-extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
+extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
 extern bool HotStandbyActive(void);
@@ -294,6 +292,7 @@ extern bool RecoveryIsPaused(void);
 extern void SetRecoveryPause(bool recoveryPause);
 extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
+extern char *XLogFileNameP(TimeLineID tli, XLogSegNo segno);
 
 extern void UpdateControlFile(void);
 extern uint64 GetSystemIdentifier(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 2020a3b..e1d4bc8 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -115,55 +115,27 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 	(((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
 
 /*
- * We break each logical log file (xlogid value) into segment files of the
- * size indicated by XLOG_SEG_SIZE.  One possible segment at the end of each
- * log file is wasted, to ensure that we don't have problems representing
- * last-byte-position-plus-1.
+ * The XLOG is split into WAL segments (physical files) of the size indicated
+ * by XLOG_SEG_SIZE.
  */
 #define XLogSegSize		((uint32) XLOG_SEG_SIZE)
-#define XLogSegsPerFile (((uint32) 0xffffffff) / XLogSegSize)
-#define XLogFileSize	(XLogSegsPerFile * XLogSegSize)
+#define XLogSegmentsPerXLogId	(0x100000000L / XLOG_SEG_SIZE)
 
+#define XLogSegNoOffsetToRecPtr(segno, offset, dest) \
+	do {	\
+		(dest).xlogid = (segno) / XLogSegmentsPerXLogId;				\
+		(dest).xrecoff = ((segno) % XLogSegmentsPerXLogId) * XLOG_SEG_SIZE + (offset); \
+	} while (0)
 
 /*
  * Macros for manipulating XLOG pointers
  */
 
-/* Increment an xlogid/segment pair */
-#define NextLogSeg(logId, logSeg)	\
-	do { \
-		if ((logSeg) >= XLogSegsPerFile-1) \
-		{ \
-			(logId)++; \
-			(logSeg) = 0; \
-		} \
-		else \
-			(logSeg)++; \
-	} while (0)
-
-/* Decrement an xlogid/segment pair (assume it's not 0,0) */
-#define PrevLogSeg(logId, logSeg)	\
-	do { \
-		if (logSeg) \
-			(logSeg)--; \
-		else \
-		{ \
-			(logId)--; \
-			(logSeg) = XLogSegsPerFile-1; \
-		} \
-	} while (0)
-
 /* Align a record pointer to next page */
 #define NextLogPage(recptr) \
 	do {	\
 		if ((recptr).xrecoff % XLOG_BLCKSZ != 0)	\
-			(recptr).xrecoff +=	\
-				(XLOG_BLCKSZ - (recptr).xrecoff % XLOG_BLCKSZ);	\
-		if ((recptr).xrecoff >= XLogFileSize) \
-		{	\
-			((recptr).xlogid)++;	\
-			(recptr).xrecoff = 0; \
-		}	\
+			XLByteAdvance(recptr, (XLOG_BLCKSZ - (recptr).xrecoff % XLOG_BLCKSZ)); \
 	} while (0)
 
 /*
@@ -175,14 +147,11 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
  * for example.  (We can assume xrecoff is not zero, since no valid recptr
  * can have that.)
  */
-#define XLByteToSeg(xlrp, logId, logSeg)	\
-	( logId = (xlrp).xlogid, \
-	  logSeg = (xlrp).xrecoff / XLogSegSize \
-	)
-#define XLByteToPrevSeg(xlrp, logId, logSeg)	\
-	( logId = (xlrp).xlogid, \
-	  logSeg = ((xlrp).xrecoff - 1) / XLogSegSize \
-	)
+#define XLByteToSeg(xlrp, logSegNo)	\
+	logSegNo = ((uint64) (xlrp).xlogid * XLogSegmentsPerXLogId) + (xlrp).xrecoff / XLogSegSize
+
+#define XLByteToPrevSeg(xlrp, logSegNo)	\
+	logSegNo = ((uint64) (xlrp).xlogid * XLogSegmentsPerXLogId) + ((xlrp).xrecoff - 1) / XLogSegSize
 
 /*
  * Is an XLogRecPtr within a particular XLOG segment?
@@ -190,13 +159,16 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
  * For XLByteInSeg, do the computation at face value.  For XLByteInPrevSeg,
  * a boundary byte is taken to be in the previous segment.
  */
-#define XLByteInSeg(xlrp, logId, logSeg)	\
-	((xlrp).xlogid == (logId) && \
-	 (xlrp).xrecoff / XLogSegSize == (logSeg))
+#define XLByteInSeg(xlrp, logSegNo)	\
+	(((xlrp).xlogid) == (logSegNo) / XLogSegmentsPerXLogId &&			\
+	 ((xlrp).xrecoff / XLogSegSize) == (logSegNo) % XLogSegmentsPerXLogId)
 
-#define XLByteInPrevSeg(xlrp, logId, logSeg)	\
-	((xlrp).xlogid == (logId) && \
-	 ((xlrp).xrecoff - 1) / XLogSegSize == (logSeg))
+#define XLByteInPrevSeg(xlrp, logSegNo)	\
+	(((xlrp).xrecoff == 0) ?											\
+		(((xlrp).xlogid - 1) == (logSegNo) / XLogSegmentsPerXLogId && \
+		 ((uint32) 0xffffffff) / XLogSegSize == (logSegNo) % XLogSegmentsPerXLogId) : \
+		((xlrp).xlogid) == (logSegNo) / XLogSegmentsPerXLogId &&	\
+		 (((xlrp).xrecoff - 1) / XLogSegSize) == (logSegNo) % XLogSegmentsPerXLogId)
 
 /* Check if an xrecoff value is in a plausible range */
 #define XRecOffIsValid(xrecoff) \
@@ -215,14 +187,23 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
  */
 #define MAXFNAMELEN		64
 
-#define XLogFileName(fname, tli, log, seg)	\
-	snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli, log, seg)
-
-#define XLogFromFileName(fname, tli, log, seg)	\
-	sscanf(fname, "%08X%08X%08X", tli, log, seg)
+#define XLogFileName(fname, tli, logSegNo)	\
+	snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli,		\
+			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
+			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
+
+#define XLogFromFileName(fname, tli, logSegNo)	\
+	do {												\
+		uint32 log;										\
+		uint32 seg;										\
+		sscanf(fname, "%08X%08X%08X", tli, &log, &seg);	\
+		*logSegNo = (uint64) log * XLogSegmentsPerXLogId + seg;	\
+	} while (0)
 
-#define XLogFilePath(path, tli, log, seg)	\
-	snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X", tli, log, seg)
+#define XLogFilePath(path, tli, logSegNo)	\
+	snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X", tli,				\
+			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId),				\
+			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
 
 #define TLHistoryFileName(fname, tli)	\
 	snprintf(fname, MAXFNAMELEN, "%08X.history", tli)
@@ -233,11 +214,15 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 #define StatusFilePath(path, xlog, suffix)	\
 	snprintf(path, MAXPGPATH, XLOGDIR "/archive_status/%s%s", xlog, suffix)
 
-#define BackupHistoryFileName(fname, tli, log, seg, offset) \
-	snprintf(fname, MAXFNAMELEN, "%08X%08X%08X.%08X.backup", tli, log, seg, offset)
+#define BackupHistoryFileName(fname, tli, logSegNo, offset) \
+	snprintf(fname, MAXFNAMELEN, "%08X%08X%08X.%08X.backup", tli, \
+			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId),		  \
+			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId), offset)
 
-#define BackupHistoryFilePath(path, tli, log, seg, offset)	\
-	snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X.%08X.backup", tli, log, seg, offset)
+#define BackupHistoryFilePath(path, tli, logSegNo, offset)	\
+	snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X.%08X.backup", tli, \
+			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
+			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId), offset)
 
 
 /*
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 5e6d7e6..6038548 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -61,16 +61,16 @@ typedef struct XLogRecPtr
  */
 #define XLByteAdvance(recptr, nbytes)						\
 	do {													\
-		if (recptr.xrecoff + nbytes >= XLogFileSize)		\
-		{													\
-			recptr.xlogid += 1;								\
-			recptr.xrecoff									\
-				= recptr.xrecoff + nbytes - XLogFileSize;	\
-		}													\
-		else												\
-			recptr.xrecoff += nbytes;						\
+		uint32 oldxrecoff = (recptr).xrecoff;				\
+		(recptr).xrecoff += nbytes;							\
+		if ((recptr).xrecoff < oldxrecoff)					\
+			(recptr).xlogid += 1;		/* xrecoff wrapped around */	\
 	} while (0)
 
+/*
+ * XLogSegNo - physical log file sequence number.
+ */
+typedef uint64 XLogSegNo;
 
 /*
  * TimeLineID (TLI) - identifies different database histories to prevent

2-move-continuation-record-to-page-header.patchtext/x-diff; name=2-move-continuation-record-to-page-header.patchDownload

commit 0114c5ed160a94e35e35c5c018360ff48ee15771
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Wed Jun 6 13:11:56 2012 +0300

    Move continuation record field xl_rem_len to xlog page header.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fddfbc4..6935149 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -693,7 +693,6 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	XLogRecord *record;
-	XLogContRecord *contrecord;
 	XLogRecPtr	RecPtr;
 	XLogRecPtr	WriteRqst;
 	uint32		freespace;
@@ -1082,9 +1081,7 @@ begin:;
 		curridx = Insert->curridx;
 		/* Insert cont-record header */
 		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
-		contrecord = (XLogContRecord *) Insert->currpos;
-		contrecord->xl_rem_len = write_len;
-		Insert->currpos += SizeOfXLogContRecord;
+		Insert->currpage->xlp_rem_len = write_len;
 		freespace = INSERT_FREESPACE(Insert);
 	}
 
@@ -3930,7 +3927,8 @@ retry:
 	if (total_len > len)
 	{
 		/* Need to reassemble record */
-		XLogContRecord *contrecord;
+		char	   *contrecord;
+		XLogPageHeader pageHeader;
 		XLogRecPtr	pagelsn;
 		uint32		gotlen = len;
 
@@ -3958,30 +3956,30 @@ retry:
 								readOff)));
 				goto next_record_is_invalid;
 			}
-			pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
-			contrecord = (XLogContRecord *) ((char *) readBuf + pageHeaderSize);
-			if (contrecord->xl_rem_len == 0 ||
-				total_len != (contrecord->xl_rem_len + gotlen))
+			pageHeader = (XLogPageHeader) readBuf;
+			pageHeaderSize = XLogPageHeaderSize(pageHeader);
+			contrecord = (char *) readBuf + pageHeaderSize;
+			if (pageHeader->xlp_rem_len == 0 ||
+				total_len != (pageHeader->xlp_rem_len + gotlen))
 			{
 				char fname[MAXFNAMELEN];
 				XLogFileName(fname, curFileTLI, readSegNo);
 				ereport(emode_for_corrupt_record(emode, *RecPtr),
 						(errmsg("invalid contrecord length %u in log segment %s, offset %u",
-								contrecord->xl_rem_len,
+								pageHeader->xlp_rem_len,
 								XLogFileNameP(curFileTLI, readSegNo),
 								readOff)));
 				goto next_record_is_invalid;
 			}
-			len = XLOG_BLCKSZ - pageHeaderSize - SizeOfXLogContRecord;
-			if (contrecord->xl_rem_len > len)
+			len = XLOG_BLCKSZ - pageHeaderSize;
+			if (pageHeader->xlp_rem_len > len)
 			{
-				memcpy(buffer, (char *) contrecord + SizeOfXLogContRecord, len);
+				memcpy(buffer, (char *) contrecord, len);
 				gotlen += len;
 				buffer += len;
 				continue;
 			}
-			memcpy(buffer, (char *) contrecord + SizeOfXLogContRecord,
-				   contrecord->xl_rem_len);
+			memcpy(buffer, (char *) contrecord, pageHeader->xlp_rem_len);
 			break;
 		}
 		if (!RecordIsValid(record, *RecPtr, emode))
@@ -3989,8 +3987,7 @@ retry:
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
 		XLogSegNoOffsetToRecPtr(
 			readSegNo,
-			readOff + pageHeaderSize +
-				MAXALIGN(SizeOfXLogContRecord + contrecord->xl_rem_len),
+			readOff + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len),
 			EndRecPtr);
 		ReadRecPtr = *RecPtr;
 		/* needn't worry about XLOG SWITCH, it can't cross page boundaries */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index e1d4bc8..239b749 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -49,26 +49,6 @@ typedef struct BkpBlock
 } BkpBlock;
 
 /*
- * When there is not enough space on current page for whole record, we
- * continue on the next page with continuation record.	(However, the
- * XLogRecord header will never be split across pages; if there's less than
- * SizeOfXLogRecord space left at the end of a page, we just waste it.)
- *
- * Note that xl_rem_len includes backup-block data; that is, it tracks
- * xl_tot_len not xl_len in the initial header.  Also note that the
- * continuation data isn't necessarily aligned.
- */
-typedef struct XLogContRecord
-{
-	uint32		xl_rem_len;		/* total len of remaining data for record */
-
-	/* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */
-
-} XLogContRecord;
-
-#define SizeOfXLogContRecord	sizeof(XLogContRecord)
-
-/*
  * Each page of XLOG file has a header like this:
  */
 #define XLOG_PAGE_MAGIC 0xD071	/* can be used as WAL version indicator */
@@ -79,6 +59,19 @@ typedef struct XLogPageHeaderData
 	uint16		xlp_info;		/* flag bits, see below */
 	TimeLineID	xlp_tli;		/* TimeLineID of first record on page */
 	XLogRecPtr	xlp_pageaddr;	/* XLOG address of this page */
+
+	/*
+	 * When there is not enough space on current page for whole record, we
+	 * continue on the next page.  xlp_rem_len is the number of bytes
+	 * remaining from a previous page. (However, the XLogRecord header will
+	 * never be split across pages; if there's less than SizeOfXLogRecord
+	 * space left at the end of a page, we just waste it.)
+	 *
+	 * Note that xl_rem_len includes backup-block data; that is, it tracks
+	 * xl_tot_len not xl_len in the initial header.  Also note that the
+	 * continuation data isn't necessarily aligned.
+	 */
+	uint32		xlp_rem_len;	/* total len of remaining data for record */
 } XLogPageHeaderData;
 
 #define SizeOfXLogShortPHD	MAXALIGN(sizeof(XLogPageHeaderData))

3-allow-wal-record-header-to-be-split.patchtext/x-diff; name=3-allow-wal-record-header-to-be-split.patchDownload

commit e20cb1e1713f2e37b3e98475a35c9b40842d20a3
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Thu Jun 7 16:09:23 2012 +0300

    Allow WAL record headers to be split across pages.
    
    Rearrange XLogRecord so that xl_tot_len is the first field, to make it
    easier to reassemble records.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6935149..3f5e0b2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -650,7 +650,9 @@ static void CleanupBackupHistory(void);
 static void UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force);
 static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt);
 static void CheckRecoveryConsistency(void);
-static bool ValidXLOGHeader(XLogPageHeader hdr, int emode);
+static bool ValidXLogPageHeader(XLogPageHeader hdr, int emode);
+static bool ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record,
+					  int emode, bool randAccess);
 static XLogRecord *ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt);
 static List *readTimeLineHistory(TimeLineID targetTLI);
 static bool existsTimeLineHistory(TimeLineID probeTLI);
@@ -692,7 +694,6 @@ XLogRecPtr
 XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	XLogRecord *record;
 	XLogRecPtr	RecPtr;
 	XLogRecPtr	WriteRqst;
 	uint32		freespace;
@@ -706,6 +707,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	XLogRecData dtbuf_rdt1[XLR_MAX_BKP_BLOCKS];
 	XLogRecData dtbuf_rdt2[XLR_MAX_BKP_BLOCKS];
 	XLogRecData dtbuf_rdt3[XLR_MAX_BKP_BLOCKS];
+	XLogRecData hdr_rdt;
 	pg_crc32	rdata_crc;
 	uint32		len,
 				write_len;
@@ -714,6 +716,15 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	bool		doPageWrites;
 	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
 	uint8		info_orig = info;
+	static XLogRecord *rechdr;
+
+	if (rechdr == NULL)
+	{
+		rechdr = malloc(SizeOfXLogRecord);
+		if (rechdr == NULL)
+			elog(ERROR, "out of memory");
+		MemSet(rechdr, 0, SizeOfXLogRecord);
+	}
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
@@ -900,6 +911,22 @@ begin:;
 	for (rdt = rdata; rdt != NULL; rdt = rdt->next)
 		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
 
+	/*
+	 * Construct record header (prev-link and CRC are filled in later), and
+	 * make that the first chunk in the chain.
+	 */
+	rechdr->xl_xid = GetCurrentTransactionIdIfAny();
+	rechdr->xl_tot_len = SizeOfXLogRecord + write_len;
+	rechdr->xl_len = len;		/* doesn't include backup blocks */
+	rechdr->xl_info = info;
+	rechdr->xl_rmid = rmid;
+
+	hdr_rdt.next = rdata;
+	hdr_rdt.data = (char *) rechdr;
+	hdr_rdt.len = SizeOfXLogRecord;
+
+	write_len += SizeOfXLogRecord;
+
 	START_CRIT_SECTION();
 
 	/* Now wait to get insert lock */
@@ -959,12 +986,12 @@ begin:;
 	}
 
 	/*
-	 * If there isn't enough space on the current XLOG page for a record
-	 * header, advance to the next page (leaving the unused space as zeroes).
+	 * If the current page is completely full, the record goes to the next
+	 * page, right after the page header.
 	 */
 	updrqst = false;
 	freespace = INSERT_FREESPACE(Insert);
-	if (freespace < SizeOfXLogRecord)
+	if (freespace == 0)
 	{
 		updrqst = AdvanceXLInsertBuffer(false);
 		freespace = INSERT_FREESPACE(Insert);
@@ -1006,21 +1033,13 @@ begin:;
 		return RecPtr;
 	}
 
-	/* Insert record header */
-
-	record = (XLogRecord *) Insert->currpos;
-	record->xl_prev = Insert->PrevRecord;
-	record->xl_xid = GetCurrentTransactionIdIfAny();
-	record->xl_tot_len = SizeOfXLogRecord + write_len;
-	record->xl_len = len;		/* doesn't include backup blocks */
-	record->xl_info = info;
-	record->xl_rmid = rmid;
+	/* Finish the record header */
+	rechdr->xl_prev = Insert->PrevRecord;
 
 	/* Now we can finish computing the record's CRC */
-	COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32),
-			   SizeOfXLogRecord - sizeof(pg_crc32));
+	COMP_CRC32(rdata_crc, (char *) rechdr, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32(rdata_crc);
-	record->xl_crc = rdata_crc;
+	rechdr->xl_crc = rdata_crc;
 
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
@@ -1030,11 +1049,11 @@ begin:;
 		initStringInfo(&buf);
 		appendStringInfo(&buf, "INSERT @ %X/%X: ",
 						 RecPtr.xlogid, RecPtr.xrecoff);
-		xlog_outrec(&buf, record);
+		xlog_outrec(&buf, rechdr);
 		if (rdata->data != NULL)
 		{
 			appendStringInfo(&buf, " - ");
-			RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data);
+			RmgrTable[rechdr->xl_rmid].rm_desc(&buf, rechdr->xl_info, rdata->data);
 		}
 		elog(LOG, "%s", buf.data);
 		pfree(buf.data);
@@ -1045,12 +1064,10 @@ begin:;
 	ProcLastRecPtr = RecPtr;
 	Insert->PrevRecord = RecPtr;
 
-	Insert->currpos += SizeOfXLogRecord;
-	freespace -= SizeOfXLogRecord;
-
 	/*
 	 * Append the data, including backup blocks if any
 	 */
+	rdata = &hdr_rdt;
 	while (write_len)
 	{
 		while (rdata->data == NULL)
@@ -1168,7 +1185,7 @@ begin:;
 		/* normal case, ie not xlog switch */
 
 		/* Need to update shared LogwrtRqst if some block was filled up */
-		if (freespace < SizeOfXLogRecord)
+		if (freespace == 0)
 		{
 			/* curridx is filled and available for writing out */
 			updrqst = true;
@@ -2087,7 +2104,7 @@ XLogFlush(XLogRecPtr record)
 				XLogCtlInsert *Insert = &XLogCtl->Insert;
 				uint32		freespace = INSERT_FREESPACE(Insert);
 
-				if (freespace < SizeOfXLogRecord)		/* buffer is full */
+				if (freespace == 0)		/* buffer is full */
 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
 				else
 				{
@@ -3694,8 +3711,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
 	}
 
 	/* Finally include the record header */
-	COMP_CRC32(crc, (char *) record + sizeof(pg_crc32),
-			   SizeOfXLogRecord - sizeof(pg_crc32));
+	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32(crc);
 
 	if (!EQ_CRC32(record->xl_crc, crc))
@@ -3725,13 +3741,13 @@ static XLogRecord *
 ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 {
 	XLogRecord *record;
-	char	   *buffer;
 	XLogRecPtr	tmpRecPtr = EndRecPtr;
 	bool		randAccess = false;
 	uint32		len,
 				total_len;
 	uint32		targetRecOff;
 	uint32		pageHeaderSize;
+	bool		gotheader;
 
 	if (readBuf == NULL)
 	{
@@ -3744,6 +3760,10 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 		 */
 		readBuf = (char *) malloc(XLOG_BLCKSZ);
 		Assert(readBuf != NULL);
+
+		readRecordBuf = malloc(XLOG_BLCKSZ);
+		Assert(readRecordBuf != NULL);
+		readRecordBufSize = XLOG_BLCKSZ;
 	}
 
 	if (RecPtr == NULL)
@@ -3751,17 +3771,10 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 		RecPtr = &tmpRecPtr;
 
 		/*
-		 * RecPtr is pointing to end+1 of the previous WAL record.  We must
-		 * advance it if necessary to where the next record starts.  First,
-		 * align to next page if no more records can fit on the current page.
-		 */
-		if (XLOG_BLCKSZ - (RecPtr->xrecoff % XLOG_BLCKSZ) < SizeOfXLogRecord)
-			NextLogPage(*RecPtr);
-
-		/*
-		 * If at page start, we must skip over the page header.  But we can't
-		 * do that until we've read in the page, since the header size is
-		 * variable.
+		 * RecPtr is pointing to end+1 of the previous WAL record.  If
+		 * we're at a page boundary, no more records can fir on the current
+		 * page. We must skip over the page header, but we can't do that
+		 * until we've read in the page, since the header size is variable.
 		 */
 	}
 	else
@@ -3782,7 +3795,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 		 * to go backwards (but we can't reset that variable right here, since
 		 * we might not change files at all).
 		 */
-		lastPageTLI = 0;		/* see comment in ValidXLOGHeader */
+		lastPageTLI = 0;		/* see comment in ValidXLogPageHeader */
 		randAccess = true;		/* allow curFileTLI to go backwards too */
 	}
 
@@ -3822,77 +3835,17 @@ retry:
 						RecPtr->xlogid, RecPtr->xrecoff)));
 		goto next_record_is_invalid;
 	}
-	record = (XLogRecord *) ((char *) readBuf + RecPtr->xrecoff % XLOG_BLCKSZ);
 
 	/*
-	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
-	 * required.
+	 * NB: Even though we use an XLogRecord pointer here, the whole record
+	 * header might not fit on this page. xl_tot_len is the first field in
+	 * struct, so it must be on this page, but we cannot safely access any
+	 * other fields yet.
 	 */
-	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
-	{
-		if (record->xl_len != 0)
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("invalid xlog switch record at %X/%X",
-							RecPtr->xlogid, RecPtr->xrecoff)));
-			goto next_record_is_invalid;
-		}
-	}
-	else if (record->xl_len == 0)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("record with zero length at %X/%X",
-						RecPtr->xlogid, RecPtr->xrecoff)));
-		goto next_record_is_invalid;
-	}
-	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
-		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
-		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid record length at %X/%X",
-						RecPtr->xlogid, RecPtr->xrecoff)));
-		goto next_record_is_invalid;
-	}
-	if (record->xl_rmid > RM_MAX_ID)
-	{
-		ereport(emode_for_corrupt_record(emode, *RecPtr),
-				(errmsg("invalid resource manager ID %u at %X/%X",
-						record->xl_rmid, RecPtr->xlogid, RecPtr->xrecoff)));
-		goto next_record_is_invalid;
-	}
-	if (randAccess)
-	{
-		/*
-		 * We can't exactly verify the prev-link, but surely it should be less
-		 * than the record's own address.
-		 */
-		if (!XLByteLT(record->xl_prev, *RecPtr))
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							record->xl_prev.xlogid, record->xl_prev.xrecoff,
-							RecPtr->xlogid, RecPtr->xrecoff)));
-			goto next_record_is_invalid;
-		}
-	}
-	else
-	{
-		/*
-		 * Record's prev-link should exactly match our previous location. This
-		 * check guards against torn WAL pages where a stale but valid-looking
-		 * WAL record starts on a sector boundary.
-		 */
-		if (!XLByteEQ(record->xl_prev, ReadRecPtr))
-		{
-			ereport(emode_for_corrupt_record(emode, *RecPtr),
-					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							record->xl_prev.xlogid, record->xl_prev.xrecoff,
-							RecPtr->xlogid, RecPtr->xrecoff)));
-			goto next_record_is_invalid;
-		}
-	}
+	record = (XLogRecord *) (readBuf + RecPtr->xrecoff % XLOG_BLCKSZ);
+	total_len = record->xl_tot_len;
 
+	/* Make sure the record buffer can hold the whole record. */
 	/*
 	 * Allocate or enlarge readRecordBuf as needed.  To avoid useless small
 	 * increases, round its size to a multiple of XLOG_BLCKSZ, and make sure
@@ -3900,16 +3853,17 @@ retry:
 	 * enough for all "normal" records, but very large commit or abort records
 	 * might need more space.)
 	 */
-	total_len = record->xl_tot_len;
 	if (total_len > readRecordBufSize)
 	{
 		uint32		newSize = total_len;
 
 		newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
 		newSize = Max(newSize, 4 * Max(BLCKSZ, XLOG_BLCKSZ));
-		if (readRecordBuf)
-			free(readRecordBuf);
-		readRecordBuf = (char *) malloc(newSize);
+		if (!readRecordBuf)
+			readRecordBuf = (char *) malloc(newSize);
+		else
+			readRecordBuf = (char *) realloc(readRecordBuf, newSize);
+
 		if (!readRecordBuf)
 		{
 			readRecordBufSize = 0;
@@ -3922,7 +3876,19 @@ retry:
 		readRecordBufSize = newSize;
 	}
 
-	buffer = readRecordBuf;
+	/*
+	 * If we got the whole header already, validate it immediately. Otherwise
+	 * we validate it after reading the rest of the header from the next page.
+	 */
+	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+	{
+		if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
+			goto next_record_is_invalid;
+		gotheader = true;
+	}
+	else
+		gotheader = false;
+
 	len = XLOG_BLCKSZ - RecPtr->xrecoff % XLOG_BLCKSZ;
 	if (total_len > len)
 	{
@@ -3930,16 +3896,19 @@ retry:
 		char	   *contrecord;
 		XLogPageHeader pageHeader;
 		XLogRecPtr	pagelsn;
-		uint32		gotlen = len;
+		char	   *buffer;
+		uint32		gotlen;
 
 		/* Initialize pagelsn to the beginning of the page this record is on */
 		pagelsn = *RecPtr;
 		pagelsn.xrecoff = (pagelsn.xrecoff / XLOG_BLCKSZ) * XLOG_BLCKSZ;
 
-		memcpy(buffer, record, len);
-		record = (XLogRecord *) buffer;
-		buffer += len;
-		for (;;)
+		/* Copy the first fragment of the record from the first page. */
+		memcpy(readRecordBuf, readBuf + RecPtr->xrecoff % XLOG_BLCKSZ, len);
+		buffer = readRecordBuf + len;
+		gotlen = len;
+
+		do
 		{
 			/* Calculate pointer to beginning of next page */
 			XLByteAdvance(pagelsn, XLOG_BLCKSZ);
@@ -3947,8 +3916,9 @@ retry:
 			if (!XLogPageRead(&pagelsn, emode, false, false))
 				return NULL;
 
-			/* Check that the continuation record looks valid */
-			if (!(((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD))
+			/* Check that the continuation on next page looks valid */
+			pageHeader = (XLogPageHeader) readBuf;
+			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
 			{
 				ereport(emode_for_corrupt_record(emode, *RecPtr),
 						(errmsg("there is no contrecord flag in log segment %s, offset %u",
@@ -3956,14 +3926,13 @@ retry:
 								readOff)));
 				goto next_record_is_invalid;
 			}
-			pageHeader = (XLogPageHeader) readBuf;
-			pageHeaderSize = XLogPageHeaderSize(pageHeader);
-			contrecord = (char *) readBuf + pageHeaderSize;
+			/*
+			 * Cross-check that xlp_rem_len agrees with how much of the record
+			 * we expect there to be left.
+			 */
 			if (pageHeader->xlp_rem_len == 0 ||
 				total_len != (pageHeader->xlp_rem_len + gotlen))
 			{
-				char fname[MAXFNAMELEN];
-				XLogFileName(fname, curFileTLI, readSegNo);
 				ereport(emode_for_corrupt_record(emode, *RecPtr),
 						(errmsg("invalid contrecord length %u in log segment %s, offset %u",
 								pageHeader->xlp_rem_len,
@@ -3971,17 +3940,28 @@ retry:
 								readOff)));
 				goto next_record_is_invalid;
 			}
+
+			/* Append the continuation from this page to the buffer */
+			pageHeaderSize = XLogPageHeaderSize(pageHeader);
+			contrecord = (char *) readBuf + pageHeaderSize;
 			len = XLOG_BLCKSZ - pageHeaderSize;
-			if (pageHeader->xlp_rem_len > len)
+			if (pageHeader->xlp_rem_len < len)
+				len = pageHeader->xlp_rem_len;
+			memcpy(buffer, (char *) contrecord, len);
+			buffer += len;
+			gotlen += len;
+
+			/* If we just reassembled the record header, validate it. */
+			if (!gotheader)
 			{
-				memcpy(buffer, (char *) contrecord, len);
-				gotlen += len;
-				buffer += len;
-				continue;
+				record = (XLogRecord *) readRecordBuf;
+				if (!ValidXLogRecordHeader(RecPtr, record, emode, randAccess))
+					goto next_record_is_invalid;
+				gotheader = true;
 			}
-			memcpy(buffer, (char *) contrecord, pageHeader->xlp_rem_len);
-			break;
-		}
+		} while (pageHeader->xlp_rem_len > len);
+
+		record = (XLogRecord *) readRecordBuf;
 		if (!RecordIsValid(record, *RecPtr, emode))
 			goto next_record_is_invalid;
 		pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
@@ -3990,18 +3970,18 @@ retry:
 			readOff + pageHeaderSize + MAXALIGN(pageHeader->xlp_rem_len),
 			EndRecPtr);
 		ReadRecPtr = *RecPtr;
-		/* needn't worry about XLOG SWITCH, it can't cross page boundaries */
-		return record;
 	}
+	else
+	{
+		/* Record does not cross a page boundary */
+		if (!RecordIsValid(record, *RecPtr, emode))
+			goto next_record_is_invalid;
+		EndRecPtr.xlogid = RecPtr->xlogid;
+		EndRecPtr.xrecoff = RecPtr->xrecoff + MAXALIGN(total_len);
 
-	/* Record does not cross a page boundary */
-	if (!RecordIsValid(record, *RecPtr, emode))
-		goto next_record_is_invalid;
-	EndRecPtr.xlogid = RecPtr->xlogid;
-	EndRecPtr.xrecoff = RecPtr->xrecoff + MAXALIGN(total_len);
-
-	ReadRecPtr = *RecPtr;
-	memcpy(buffer, record, total_len);
+		ReadRecPtr = *RecPtr;
+		memcpy(readRecordBuf, record, total_len);
+	}
 
 	/*
 	 * Special processing if it's an XLOG SWITCH record
@@ -4019,7 +3999,7 @@ retry:
 		 */
 		readOff = XLogSegSize - XLOG_BLCKSZ;
 	}
-	return (XLogRecord *) buffer;
+	return record;
 
 next_record_is_invalid:
 	failedSources |= readSource;
@@ -4044,7 +4024,7 @@ next_record_is_invalid:
  * ReadRecord.	It's not intended for use from anywhere else.
  */
 static bool
-ValidXLOGHeader(XLogPageHeader hdr, int emode)
+ValidXLogPageHeader(XLogPageHeader hdr, int emode)
 {
 	XLogRecPtr	recaddr;
 
@@ -4163,6 +4143,88 @@ ValidXLOGHeader(XLogPageHeader hdr, int emode)
 }
 
 /*
+ * Validate an XLOG record header.
+ *
+ * This is just a convenience subroutine to avoid duplicated code in
+ * ReadRecord.	It's not intended for use from anywhere else.
+ */
+static bool
+ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
+					  bool randAccess)
+{
+	/*
+	 * xl_len == 0 is bad data for everything except XLOG SWITCH, where it is
+	 * required.
+	 */
+	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
+	{
+		if (record->xl_len != 0)
+		{
+			ereport(emode_for_corrupt_record(emode, *RecPtr),
+					(errmsg("invalid xlog switch record at %X/%X",
+							RecPtr->xlogid, RecPtr->xrecoff)));
+			return false;
+		}
+	}
+	else if (record->xl_len == 0)
+	{
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
+				(errmsg("record with zero length at %X/%X",
+						RecPtr->xlogid, RecPtr->xrecoff)));
+		return false;
+	}
+	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
+		record->xl_tot_len > SizeOfXLogRecord + record->xl_len +
+		XLR_MAX_BKP_BLOCKS * (sizeof(BkpBlock) + BLCKSZ))
+	{
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
+				(errmsg("invalid record length at %X/%X",
+						RecPtr->xlogid, RecPtr->xrecoff)));
+		return false;
+	}
+	if (record->xl_rmid > RM_MAX_ID)
+	{
+		ereport(emode_for_corrupt_record(emode, *RecPtr),
+				(errmsg("invalid resource manager ID %u at %X/%X",
+						record->xl_rmid, RecPtr->xlogid, RecPtr->xrecoff)));
+		return false;
+	}
+	if (randAccess)
+	{
+		/*
+		 * We can't exactly verify the prev-link, but surely it should be less
+		 * than the record's own address.
+		 */
+		if (!XLByteLT(record->xl_prev, *RecPtr))
+		{
+			ereport(emode_for_corrupt_record(emode, *RecPtr),
+					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+							record->xl_prev.xlogid, record->xl_prev.xrecoff,
+							RecPtr->xlogid, RecPtr->xrecoff)));
+			return false;
+		}
+	}
+	else
+	{
+		/*
+		 * Record's prev-link should exactly match our previous location. This
+		 * check guards against torn WAL pages where a stale but valid-looking
+		 * WAL record starts on a sector boundary.
+		 */
+		if (!XLByteEQ(record->xl_prev, ReadRecPtr))
+		{
+			ereport(emode_for_corrupt_record(emode, *RecPtr),
+					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
+							record->xl_prev.xlogid, record->xl_prev.xrecoff,
+							RecPtr->xlogid, RecPtr->xrecoff)));
+			return false;
+		}
+	}
+
+	return true;
+}
+
+/*
  * Try to read a timeline's history file.
  *
  * If successful, return the list of component TLIs (the given TLI followed by
@@ -5171,8 +5233,7 @@ BootStrapXLOG(void)
 
 	INIT_CRC32(crc);
 	COMP_CRC32(crc, &checkPoint, sizeof(checkPoint));
-	COMP_CRC32(crc, (char *) record + sizeof(pg_crc32),
-			   SizeOfXLogRecord - sizeof(pg_crc32));
+	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32(crc);
 	record->xl_crc = crc;
 
@@ -7707,7 +7768,7 @@ CreateCheckPoint(int flags)
 	 * checkpoint, even though physically before it.  Got that?
 	 */
 	freespace = INSERT_FREESPACE(Insert);
-	if (freespace < SizeOfXLogRecord)
+	if (freespace == 0)
 	{
 		(void) AdvanceXLInsertBuffer(false);
 		/* OK to ignore update return flag, since we will do flush anyway */
@@ -10269,7 +10330,7 @@ retry:
 							fname, readOff)));
 			goto next_record_is_invalid;
 		}
-		if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+		if (!ValidXLogPageHeader((XLogPageHeader) readBuf, emode))
 			goto next_record_is_invalid;
 	}
 
@@ -10295,7 +10356,7 @@ retry:
 				fname, readOff)));
 		goto next_record_is_invalid;
 	}
-	if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode))
+	if (!ValidXLogPageHeader((XLogPageHeader) readBuf, emode))
 		goto next_record_is_invalid;
 
 	Assert(targetSegNo == readSegNo);
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 0012cff..15f2b27 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -942,8 +942,7 @@ WriteEmptyXLOG(void)
 
 	INIT_CRC32(crc);
 	COMP_CRC32(crc, &ControlFile.checkPointCopy, sizeof(CheckPoint));
-	COMP_CRC32(crc, (char *) record + sizeof(pg_crc32),
-			   SizeOfXLogRecord - sizeof(pg_crc32));
+	COMP_CRC32(crc, (char *) record, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32(crc);
 	record->xl_crc = crc;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index b581910..ec79870 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -40,15 +40,16 @@
  */
 typedef struct XLogRecord
 {
-	pg_crc32	xl_crc;			/* CRC for this record */
-	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
-	TransactionId xl_xid;		/* xact id */
 	uint32		xl_tot_len;		/* total len of entire record */
+	TransactionId xl_xid;		/* xact id */
 	uint32		xl_len;			/* total len of rmgr data */
 	uint8		xl_info;		/* flag bits, see below */
 	RmgrId		xl_rmid;		/* resource manager for this record */
+	/* 2 bytes of padding here, initialize to zero */
+	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
+	pg_crc32	xl_crc;			/* CRC for this record */
 
-	/* Depending on MAXALIGN, there are either 2 or 6 wasted bytes here */
+	/* If MAXALIGN==8, there are 4 wasted bytes here */
 
 	/* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 239b749..a958856 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -63,9 +63,7 @@ typedef struct XLogPageHeaderData
 	/*
 	 * When there is not enough space on current page for whole record, we
 	 * continue on the next page.  xlp_rem_len is the number of bytes
-	 * remaining from a previous page. (However, the XLogRecord header will
-	 * never be split across pages; if there's less than SizeOfXLogRecord
-	 * space left at the end of a page, we just waste it.)
+	 * remaining from a previous page.
 	 *
 	 * Note that xl_rem_len includes backup-block data; that is, it tracks
 	 * xl_tot_len not xl_len in the initial header.  Also note that the

4-WIP-xloginsert-scale.patchtext/x-diff; name=4-WIP-xloginsert-scale.patchDownload

commit 83b1e4fcd74b4dd6c6992395f21e4fe606c8e80d
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Thu Jun 14 23:53:17 2012 +0300

    Rebase code from xloginsert-noslots branch.
    
    This is based on xloginsert-scale18.patch, but instead of slots, use the
    xl_rem_len to indicate that a record has been fully written.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3f5e0b2..0d0e799 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
 #include "postmaster/startup.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -261,14 +262,26 @@ XLogRecPtr	XactLastRecEnd = {0, 0};
  * (which is almost but not quite the same as a pointer to the most recent
  * CHECKPOINT record).	We update this from the shared-memory copy,
  * XLogCtl->Insert.RedoRecPtr, whenever we can safely do so (ie, when we
- * hold the Insert lock).  See XLogInsert for details.	We are also allowed
- * to update from XLogCtl->Insert.RedoRecPtr if we hold the info_lck;
+ * hold the insertpos lock).  See XLogInsert for details.	We are also allowed
+ * to update from XLogCtl->RedoRecPtr if we hold the info_lck;
  * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
  * InitXLOGAccess.
  */
 static XLogRecPtr RedoRecPtr;
 
 /*
+ * doPageWrites is this backend's local copy of the Insert->fullPageWrites ||
+ * Insert->forcePageWrites. It is refreshed at every insertion.
+ */
+static bool doPageWrites;
+
+/*
+ * FinalizedUpto is this backend's local copy of XLogCtl->Insert.FinalizedUpto.
+ * Everything before this is CRC'd and ready for writing out.
+ */
+static XLogRecPtr FinalizedUpto = { 0, 0 };
+
+/*
  * RedoStartLSN points to the checkpoint's REDO location which is specified
  * in a backup label file, backup history file or control file. In standby
  * mode, XLOG streaming usually starts from the position where an invalid
@@ -300,10 +313,15 @@ static XLogRecPtr RedoStartLSN = {0, 0};
  * (protected by info_lck), but we don't need to cache any copies of it.
  *
  * info_lck is only held long enough to read/update the protected variables,
- * so it's a plain spinlock.  The other locks are held longer (potentially
- * over I/O operations), so we use LWLocks for them.  These locks are:
+ * so it's a plain spinlock.  insertpos_lck protects the current logical
+ * insert location, ie. the head of reserved WAL space.  The other locks are
+ * held longer (potentially over I/O operations), so we use LWLocks for them.
+ * These locks are:
  *
- * WALInsertLock: must be held to insert a record into the WAL buffers.
+ * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
+ * This is only held while initializing and changing the mapping. If the
+ * contents of the buffer being replaced haven't been written yet, the mapping
+ * lock is released while the write is done, and reacquired afterwards.
  *
  * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
  * XLogFlush).
@@ -315,6 +333,93 @@ static XLogRecPtr RedoStartLSN = {0, 0};
  * only one checkpointer at a time; currently, with all checkpoints done by
  * the checkpointer, this is just pro forma).
  *
+ * WALInsertShareLocks: This lock is partitioned into multiple lwlocks. To
+ * hold it in share mode, it's enough to hold any of the lwlocks in share mode,
+ * but to hold it in exclusive mode, you must grab all the lwlocks.  It must
+ * be held in share-mode while inserting a new XLOG record, and in exclusive
+ * mode when changing RedoRecPtr or fullPageWrites. Those fields determine
+ * whether full-page images are included in a record, and they change very
+ * seldom, so we prefer to be fast and non-contended when they need to be
+ * read, and slow when they're changed.
+ *
+ *
+ * Inserting a new WAL record is a three-step process:
+ *
+ * 1. Reserve the right amount of space from the WAL. The current head of
+ *    reserved space is kept in Insert->CurrBytePos, and is protected by
+ *    insertpos_lck. Try to keep this section as short as possible,
+ *    insertpos_lck can be heavily contended on a busy system.
+ *
+ * 2. Copy the record to the reserved WAL space. This involves finding the
+ *    correct WAL buffer containing the reserved space, and copying the
+ *    record in place. This can be done concurrently in multiple processes.
+ *
+ * 3. Finalize the record by filling in xl_prev, and updating the CRC with it.
+ *    This can be done by another process, long after step 2. This only needs
+ *    to be done just before the record is flushed to disk, so it's done in
+ *    bulk at that point.
+ *
+ * To allow as much parallelism as possible, the conteneded portion of step 1
+ * is performed while only holding a spinlock. The duration the spinlock
+ * needs to be held is minimized by minimizing the calculations that have to
+ * be done while holding the lock. The current tip of reserved WAL is kept
+ * in CurrBytePos, as a byte position that only counts "usable" bytes in WAL,
+ * that is, it excludes all WAL page headers. The mapping between "usable" byte
+ * positions and physical positions (XLogRecPtrs) can be done outside the
+ * locked region, and because the usable byte position doesn't include any
+ * headers, reserving X bytes from WAL is simply "CurrBytePos += X". On
+ * platforms that have an atomic 64-bit fetch-and-add instruction, we don't
+ * even need a spinlock (XXX: not implemented yet - ATM spinlock is always
+ * used).
+ *
+ * Step 2 can usually be done completely in parallel. If the required WAL
+ * page is not initialized yet, you have to grab WALBufMappingLock to
+ * initialize it, but we pre-initialize WAL buffers in the WAL writer to
+ * avoid that from happening in the critical path.
+ *
+ * In step 2, the xl_prev field is left at 0/0, because even though we've
+ * reserved a slice of WAL space for the record, we don't know where the
+ * previous record began. We could keep track of that along with CurrBytePos,
+ * in step 1, but then it would no longer be possible to implement it with
+ * an atomic fetch-and-add instruction. So at step 3, we finalize all the
+ * records by filling in xl_prev, and calculating the final CRC that includes
+ * xl_prev as well. Finalization starts from the end of the last finalized
+ * records, and walks the chain of WAL records until it hits a record with
+ * xl_tot_len == 0. Setting xl_tot_len is a sign that the record is fully
+ * written - a memory barrier ensures that xl_tot_len is not seen by other
+ * processes before the rest of the record. If the record doesn't fit on the
+ * page, setting xl_tot_len indicates that the record is fully written up to
+ * the page boundary, and on the next page, setting XLP_FIRST_IS_CONTRECORD
+ * acts as a signal that the continued part is fully written to the page.
+ *
+ * XXX: There is currently no good mechanism to wait for step 2 of an
+ * insertion to finish. Step 3 busy-loops. In the previous version of this
+ * patch, which used "insertion slots", the slot included a linked list of
+ * PGPROCs waiting for the slot to finish inserting, similar to LWLocks.
+ * We'll probably need to add something like that, busy-waiting is not good.
+ *
+ *
+ * Deadlock analysis
+ * -----------------
+ *
+ * It's important to call WaitXLogInsertionsToFinish() *before* acquiring
+ * WALWriteLock. Otherwise you might get stuck waiting for an insertion to
+ * finish (or at least advance to next uninitialized page), while you're
+ * holding WALWriteLock. That would be bad, because the backend you're waiting
+ * for might need to acquire WALWriteLock, too, to evict an old buffer, so
+ * you'd get deadlock.
+ *
+ * WaitXLogInsertionsToFinish() will not get stuck indefinitely, as long as
+ * it's called with a location that's known to be already allocated in the WAL
+ * buffers. Calling it with the position of a record you've already inserted
+ * satisfies that condition, so the common pattern:
+ *
+ *   recptr = XLogInsert(...)
+ *   XLogFlush(recptr)
+ *
+ * is safe. It can't get stuck, because an insertion to a WAL page that's
+ * already initialized in cache can always proceed without waiting on a lock.
+ *
  *----------
  */
 
@@ -335,12 +440,26 @@ typedef struct XLogwrtResult
  */
 typedef struct XLogCtlInsert
 {
-	XLogRecPtr	PrevRecord;		/* start of previously-inserted record */
-	int			curridx;		/* current block index in cache */
-	XLogPageHeader currpage;	/* points to header of block in cache */
-	char	   *currpos;		/* current insertion point in cache */
-	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
-	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
+	slock_t		insertpos_lck;	/* protects CurrBytePos */
+
+	/*
+	 * CurrBytePos is the very tip of the reserved WAL space at the moment.
+	 * The next record will be inserted there.
+	 */
+	uint64		CurrBytePos;
+
+	/*
+	 * These fields track the progress of record finalization. FinalizedUpto
+	 * points to the end of fully finalized portion - everything before it
+	 * is ready to be written to disk. LastFinalizedRecord points to the
+	 * beginning of the last finalized record. When the next record is
+	 * finalized, it is written to the xl_prev of the next record. If
+	 * ExpectingContRecord is true, we are stopped at a page boundary, in the
+	 * middle of a WAL record. These fields are protected by WALInsertTailLock.
+	 */
+	XLogRecPtr	FinalizedUpto;
+	XLogRecPtr	LastFinalizedRecord;
+	bool		ExpectingContRecord;
 
 	/*
 	 * fullPageWrites is the master copy used by all backends to determine
@@ -348,7 +467,11 @@ typedef struct XLogCtlInsert
 	 * one. This is required because, when full_page_writes is changed
 	 * by SIGHUP, we must WAL-log it before it actually affects
 	 * WAL-logging by backends. Checkpointer sets at startup or after SIGHUP.
+	 *
+	 * These fields are protected by WALInsertShareLocks.
 	 */
+	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
+	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
 	bool		fullPageWrites;
 
 	/*
@@ -372,16 +495,21 @@ typedef struct XLogCtlWrite
 	pg_time_t	lastSegSwitchTime;		/* time of last xlog segment switch */
 } XLogCtlWrite;
 
+
 /*
  * Total shared-memory state for XLOG.
  */
 typedef struct XLogCtlData
 {
-	/* Protected by WALInsertLock: */
+	/*
+	 * Note: Insert must be the first field in the struct or it won't be
+	 * aligned to a cache-line boundary like we want it to be.
+	 */
 	XLogCtlInsert Insert;
 
 	/* Protected by info_lck: */
 	XLogwrtRqst LogwrtRqst;
+	XLogRecPtr	RedoRecPtr;		/* a recent copy of Insert->RedoRecPtr */
 	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
 	TransactionId ckptXid;
 	XLogRecPtr	asyncXactLSN;	/* LSN of newest async commit/abort */
@@ -397,9 +525,18 @@ typedef struct XLogCtlData
 	XLogwrtResult LogwrtResult;
 
 	/*
+	 * To change curridx and the identity of a buffer, you need to hold
+	 * WALBufMappingLock.  To change the identity of a buffer that's still
+	 * dirty, the old page needs to be written out first, and for that you
+	 * need WALWriteLock, and you need to ensure that there's no in-progress
+	 * insertions to the page by calling WaitXLogInsertionsToFinish().
+	 */
+	int			curridx;		/* latest initialized block index in cache */
+
+	/*
 	 * These values do not change after startup, although the pointed-to pages
 	 * and xlblocks values certainly do.  Permission to read/write the pages
-	 * and xlblocks values depends on WALInsertLock and WALWriteLock.
+	 * and xlblocks values depends on WALBufMappingLock and WALWriteLock.
 	 */
 	char	   *pages;			/* buffers for unwritten XLOG pages */
 	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
@@ -479,30 +616,37 @@ static XLogCtlData *XLogCtl = NULL;
 static ControlFileData *ControlFile = NULL;
 
 /*
- * Macros for managing XLogInsert state.  In most cases, the calling routine
- * has local copies of XLogCtl->Insert and/or XLogCtl->Insert->curridx,
- * so these are passed as parameters instead of being fetched via XLogCtl.
+ * Calculate the amount of space left on the page after 'endptr'.
+ * Beware multiple evaluation!
  */
+#define INSERT_FREESPACE(endptr)	\
+	(((endptr).xrecoff % XLOG_BLCKSZ == 0) ? 0 : (XLOG_BLCKSZ - (endptr).xrecoff % XLOG_BLCKSZ))
 
-/* Free space remaining in the current xlog page buffer */
-#define INSERT_FREESPACE(Insert)  \
-	(XLOG_BLCKSZ - ((Insert)->currpos - (char *) (Insert)->currpage))
+/*
+ * Macros to advance to next buffer index and insertion slot.
+ */
+#define NextBufIdx(idx)		\
+		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
 
-/* Construct XLogRecPtr value for current insertion point */
-#define INSERT_RECPTR(recptr,Insert,curridx)  \
-	do {																\
-		(recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid;			\
-		(recptr).xrecoff =												\
-			XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert); \
-		if (XLogCtl->xlblocks[curridx].xrecoff == 0)					\
-			(recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid - 1;	\
-	} while(0)
+/*
+ * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
+ * would hold if it was in cache, the page containing 'recptr'.
+ *
+ * XLogRecEndPtrToBufIdx is the same, but a pointer to the first byte of a
+ * page is taken to mean the previous page.
+ */
+#define XLogRecPtrToBufIdx(recptr)	\
+	(((((((uint64) (recptr).xlogid) << 32) + (recptr).xrecoff)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
 
-#define PrevBufIdx(idx)		\
-		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
+#define XLogRecEndPtrToBufIdx(recptr)	\
+	(((((((uint64) (recptr).xlogid) << 32) + (recptr).xrecoff - 1)) / XLOG_BLCKSZ) % (XLogCtl->XLogCacheBlck + 1))
 
-#define NextBufIdx(idx)		\
-		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
+/*
+ * These are the number of bytes usable in a WAL page and segment, excluding
+ * page headers.
+ */
+#define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
+#define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) * UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD))
 
 /*
  * Private, possibly out-of-date copy of shared LogwrtResult.
@@ -625,9 +769,9 @@ static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 
 static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
 				XLogRecPtr *lsn, BkpBlock *bkpb);
-static bool AdvanceXLInsertBuffer(bool new_segment);
+static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
-static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch);
+static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, int *max_advance,
 					   bool use_lock);
@@ -674,6 +818,75 @@ static bool read_backup_label(XLogRecPtr *checkPointLoc,
 static void rm_redo_error_callback(void *arg);
 static int	get_sync_bit(int method);
 
+static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
+				  XLogRecData *rdata,
+				  XLogRecPtr StartPos, XLogRecPtr EndPos);
+static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
+						  XLogRecPtr *EndPos);
+static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos);
+static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
+static char *GetXLogBuffer(XLogRecPtr ptr, bool failok);
+static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
+static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
+static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
+
+/*
+ * Equivalent of LWLockAcquire() for the partitioned WALInsertShareLock.
+ */
+static void
+WALInsertLockAcquire(LWLockMode mode)
+{
+	int lockid;
+
+	if (mode == LW_EXCLUSIVE)
+	{
+		/*
+		 * To acquire the lock in exclusive mode, need to hold all the
+		 * partition locks.
+		 */
+		for (lockid = FirstWALInsertShareLock; lockid <= LastWALInsertShareLock; lockid++)
+		{
+			LWLockAcquire(lockid, LW_EXCLUSIVE);
+		}
+	}
+	else
+	{
+		/*
+		 * Grab one of the partitioned locks. It doesn't matter which one,
+		 * but to avoid contention, it's good if different processes choose
+		 * different locks.
+		 */
+		lockid = FirstWALInsertShareLock +
+			(MyProc->pgprocno % (LastWALInsertShareLock - FirstWALInsertShareLock + 1));
+		LWLockAcquire(lockid, LW_SHARED);
+	}
+}
+
+/*
+ * Equivalent of LWLockRelease() for the partitioned WALInsertShareLock.
+ */
+static void
+WALInsertLockRelease(LWLockMode mode)
+{
+	int lockid;
+
+	if (mode == LW_EXCLUSIVE)
+	{
+		for (lockid = FirstWALInsertShareLock; lockid <= LastWALInsertShareLock; lockid++)
+		{
+			LWLockRelease(lockid);
+		}
+	}
+	else
+	{
+		/*
+		 * this calculation better match the one used when the lock was
+		 * acquired.
+		 */
+		lockid = FirstWALInsertShareLock + (MyProc->pgprocno % (LastWALInsertShareLock - FirstWALInsertShareLock + 1));
+		LWLockRelease(lockid);
+	}
+}
 
 /*
  * Insert an XLOG record having the specified RMID and info bytes,
@@ -693,11 +906,7 @@ static int	get_sync_bit(int method);
 XLogRecPtr
 XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	XLogRecPtr	RecPtr;
-	XLogRecPtr	WriteRqst;
-	uint32		freespace;
-	int			curridx;
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
 	XLogRecData *rdt;
 	XLogRecData *rdt_lastnormal;
 	Buffer		dtbuf[XLR_MAX_BKP_BLOCKS];
@@ -712,12 +921,14 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	uint32		len,
 				write_len;
 	unsigned	i;
-	bool		updrqst;
-	bool		doPageWrites;
 	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
-	uint8		info_orig = info;
 	static XLogRecord *rechdr;
+	XLogRecPtr	StartPos;
+	XLogRecPtr	EndPos;
 
+	/*
+	 * On the first call, allocate a buffer to hold the xlog record.
+	 */
 	if (rechdr == NULL)
 	{
 		rechdr = malloc(SizeOfXLogRecord);
@@ -742,40 +953,33 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	 */
 	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
 	{
-		RecPtr.xlogid = 0;
-		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
-		return RecPtr;
+		EndPos.xlogid = 0;
+		EndPos.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
+		return EndPos;
 	}
 
 	/*
 	 * Here we scan the rdata chain, to determine which buffers must be backed
 	 * up.
 	 *
-	 * We may have to loop back to here if a race condition is detected below.
-	 * We could prevent the race by doing all this work while holding the
-	 * insert lock, but it seems better to avoid doing CRC calculations while
-	 * holding the lock.
-	 *
 	 * We add entries for backup blocks to the chain, so that they don't
 	 * need any special treatment in the critical section where the chunks are
-	 * copied into the WAL buffers. Those entries have to be unlinked from the
-	 * chain if we have to loop back here.
+	 * copied into the WAL buffers.
+	 *
+	 * First acquire WALInsertShareLock, to prevent RedoRecPtr and
+	 * force/fullPageWrites flags from changing.
 	 */
-begin:;
+	WALInsertLockAcquire(isLogSwitch ? LW_EXCLUSIVE : LW_SHARED);
+
+	doPageWrites = Insert->forcePageWrites || Insert->fullPageWrites;
+	RedoRecPtr = Insert->RedoRecPtr;
+
 	for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
 	{
 		dtbuf[i] = InvalidBuffer;
 		dtbuf_bkp[i] = false;
 	}
 
-	/*
-	 * Decide if we need to do full-page writes in this XLOG record: true if
-	 * full_page_writes is on or we have a PITR request for it.  Since we
-	 * don't yet have the insert lock, fullPageWrites and forcePageWrites
-	 * could change under us, but we'll recheck them once we have the lock.
-	 */
-	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
-
 	len = 0;
 	for (rdt = rdata;;)
 	{
@@ -831,8 +1035,7 @@ begin:;
 	 * NOTE: We disallow len == 0 because it provides a useful bit of extra
 	 * error checking in ReadRecord.  This means that all callers of
 	 * XLogInsert must supply at least some not-in-a-buffer data.  However, we
-	 * make an exception for XLOG SWITCH records because we don't want them to
-	 * ever cross a segment boundary.
+	 * make an exception for XLOG SWITCH records.
 	 */
 	if (len == 0 && !isLogSwitch)
 		elog(PANIC, "invalid xlog record length %u", len);
@@ -840,9 +1043,7 @@ begin:;
 	/*
 	 * Make additional rdata chain entries for the backup blocks, so that we
 	 * don't need to special-case them in the write loop.  This modifies the
-	 * original rdata chain, but we keep a pointer to the last regular entry,
-	 * rdt_lastnormal, so that we can undo this if we have to loop back to the
-	 * beginning.
+	 * original rdata chain.
 	 *
 	 * At the exit of this loop, write_len includes the backup block data.
 	 *
@@ -912,15 +1113,23 @@ begin:;
 		COMP_CRC32(rdata_crc, rdt->data, rdt->len);
 
 	/*
-	 * Construct record header (prev-link and CRC are filled in later), and
-	 * make that the first chunk in the chain.
+	 * Construct record header (prev-link is filled in later, in record
+	 * finalization), and make that the first chunk in the chain.
 	 */
 	rechdr->xl_xid = GetCurrentTransactionIdIfAny();
 	rechdr->xl_tot_len = SizeOfXLogRecord + write_len;
 	rechdr->xl_len = len;		/* doesn't include backup blocks */
 	rechdr->xl_info = info;
 	rechdr->xl_rmid = rmid;
+	rechdr->xl_prev = InvalidXLogRecPtr;
+	/*
+	 * The CRC calculated here doesn't include the correct prev-link yet.
+	 * It will be updated in record finalization.
+	 */
+	COMP_CRC32(rdata_crc, ((char *) rechdr), offsetof(XLogRecord, xl_prev));
+	rechdr->xl_crc = rdata_crc;
 
+	/* Make the record header the first chunk in the chain */
 	hdr_rdt.next = rdata;
 	hdr_rdt.data = (char *) rechdr;
 	hdr_rdt.len = SizeOfXLogRecord;
@@ -929,118 +1138,82 @@ begin:;
 
 	START_CRIT_SECTION();
 
-	/* Now wait to get insert lock */
-	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-
 	/*
-	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
-	 * back and recompute everything.  This can only happen just after a
-	 * checkpoint, so it's better to be slow in this case and fast otherwise.
-	 *
-	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
-	 * affect the contents of the XLOG record, so we'll update our local copy
-	 * but not force a recomputation.
+	 * Reserve space for the record from the WAL, and copy the record there.
 	 */
-	if (!XLByteEQ(RedoRecPtr, Insert->RedoRecPtr))
+	if (isLogSwitch)
 	{
-		Assert(XLByteLT(RedoRecPtr, Insert->RedoRecPtr));
-		RedoRecPtr = Insert->RedoRecPtr;
+		if (ReserveXLogSwitch(&StartPos, &EndPos))
+		{
+			WaitXLogInsertionsToFinish(StartPos);
 
-		if (doPageWrites)
+			CopyXLogRecordToWAL(write_len, isLogSwitch, &hdr_rdt,
+								StartPos, EndPos);
+		}
+		else
 		{
-			for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
-			{
-				if (dtbuf[i] == InvalidBuffer)
-					continue;
-				if (dtbuf_bkp[i] == false &&
-					XLByteLE(dtbuf_lsn[i], RedoRecPtr))
-				{
-					/*
-					 * Oops, this buffer now needs to be backed up, but we
-					 * didn't think so above.  Start over.
-					 */
-					LWLockRelease(WALInsertLock);
-					END_CRIT_SECTION();
-					rdt_lastnormal->next = NULL;
-					info = info_orig;
-					goto begin;
-				}
-			}
+			/*
+			 * The current insert location was already exactly at the beginning
+			 * of a segment, so there's no need to switch.
+			 */
 		}
 	}
-
-	/*
-	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
-	 * if we weren't already doing full-page writes then go back and recompute.
-	 * (If it was just turned off, we could recompute the record without full pages,
-	 * but we choose not to bother.)
-	 */
-	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
+	else
 	{
-		/* Oops, must redo it with full-page data. */
-		LWLockRelease(WALInsertLock);
-		END_CRIT_SECTION();
-		rdt_lastnormal->next = NULL;
-		info = info_orig;
-		goto begin;
+		ReserveXLogInsertLocation(write_len, &StartPos, &EndPos);
+
+		/* And copy the record there. */
+		CopyXLogRecordToWAL(write_len, isLogSwitch, &hdr_rdt, StartPos, EndPos);
 	}
+	END_CRIT_SECTION();
+
+	WALInsertLockRelease(isLogSwitch ? LW_EXCLUSIVE : LW_SHARED);
 
 	/*
-	 * If the current page is completely full, the record goes to the next
-	 * page, right after the page header.
+	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
 	 */
-	updrqst = false;
-	freespace = INSERT_FREESPACE(Insert);
-	if (freespace == 0)
+	if (StartPos.xrecoff / XLOG_BLCKSZ != EndPos.xrecoff / XLOG_BLCKSZ)
 	{
-		updrqst = AdvanceXLInsertBuffer(false);
-		freespace = INSERT_FREESPACE(Insert);
-	}
+		/* use volatile pointer to prevent code rearrangement */
+		volatile XLogCtlData *xlogctl = XLogCtl;
 
-	/* Compute record's XLOG location */
-	curridx = Insert->curridx;
-	INSERT_RECPTR(RecPtr, Insert, curridx);
+		SpinLockAcquire(&xlogctl->info_lck);
+		/* advance global request to include new block(s) */
+		if (XLByteLT(xlogctl->LogwrtRqst.Write, EndPos))
+			xlogctl->LogwrtRqst.Write = EndPos;
+		/* update local result copy while I have the chance */
+		LogwrtResult = xlogctl->LogwrtResult;
+		SpinLockRelease(&xlogctl->info_lck);
+	}
 
 	/*
-	 * If the record is an XLOG_SWITCH, and we are exactly at the start of a
-	 * segment, we need not insert it (and don't want to because we'd like
-	 * consecutive switch requests to be no-ops).  Instead, make sure
-	 * everything is written and flushed through the end of the prior segment,
-	 * and return the prior segment's end address.
+	 * If this was an XLOG_SWITCH record, flush the record and the empty
+	 * padding space that fills the rest of the segment, and perform
+	 * end-of-segment actions (eg, notifying archiver).
 	 */
-	if (isLogSwitch &&
-		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
+	if (isLogSwitch)
 	{
-		/* We can release insert lock immediately */
-		LWLockRelease(WALInsertLock);
-
-		RecPtr.xrecoff -= SizeOfXLogLongPHD;
-
-		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-		LogwrtResult = XLogCtl->LogwrtResult;
-		if (!XLByteLE(RecPtr, LogwrtResult.Flush))
+		TRACE_POSTGRESQL_XLOG_SWITCH();
+		XLogFlush(EndPos);
+		/*
+		 * Even though we reserved the rest of the segment for us, which is
+		 * reflected in EndPos, we return a pointer to just the end of the
+		 * xlog-switch record.
+		 */
+		if (StartPos.xrecoff % XLOG_SEG_SIZE != 0)
 		{
-			XLogwrtRqst FlushRqst;
-
-			FlushRqst.Write = RecPtr;
-			FlushRqst.Flush = RecPtr;
-			XLogWrite(FlushRqst, false, false);
+			EndPos = StartPos;
+			XLByteAdvance(EndPos, SizeOfXLogRecord);
+			if (StartPos.xrecoff / XLOG_BLCKSZ != EndPos.xrecoff / XLOG_BLCKSZ)
+			{
+				if (EndPos.xrecoff % XLOG_SEG_SIZE == EndPos.xrecoff % XLOG_BLCKSZ)
+					EndPos.xrecoff += SizeOfXLogLongPHD;
+				else
+					EndPos.xrecoff += SizeOfXLogShortPHD;
+			}
 		}
-		LWLockRelease(WALWriteLock);
-
-		END_CRIT_SECTION();
-
-		return RecPtr;
 	}
 
-	/* Finish the record header */
-	rechdr->xl_prev = Insert->PrevRecord;
-
-	/* Now we can finish computing the record's CRC */
-	COMP_CRC32(rdata_crc, (char *) rechdr, offsetof(XLogRecord, xl_crc));
-	FIN_CRC32(rdata_crc);
-	rechdr->xl_crc = rdata_crc;
-
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 	{
@@ -1048,7 +1221,7 @@ begin:;
 
 		initStringInfo(&buf);
 		appendStringInfo(&buf, "INSERT @ %X/%X: ",
-						 RecPtr.xlogid, RecPtr.xrecoff);
+						 EndPos.xlogid, EndPos.xrecoff);
 		xlog_outrec(&buf, rechdr);
 		if (rdata->data != NULL)
 		{
@@ -1060,165 +1233,741 @@ begin:;
 	}
 #endif
 
-	/* Record begin of record in appropriate places */
-	ProcLastRecPtr = RecPtr;
-	Insert->PrevRecord = RecPtr;
+	/*
+	 * Update our global variables
+	 */
+	ProcLastRecPtr = StartPos;
+	XactLastRecEnd = EndPos;
+
+	return EndPos;
+}
+
+/*
+ * Subroutine of XLogInsert.  Copies a WAL record to an already-reserved
+ * area in the WAL.
+ */
+static void
+CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
+					XLogRecData *rdata,
+					XLogRecPtr StartPos, XLogRecPtr EndPos)
+{
+	char	   *currpos;
+	int			freespace;
+	int			written;
+	XLogRecPtr	CurrPos;
+	XLogRecord *rechdr;
+	uint32	   *tot_len_p;
+	bool		firstpage = true;
+	XLogPageHeader pagehdr = NULL;
+
+	/* The first chunk should be the record header */
+	rechdr = (XLogRecord *) rdata->data;
+	Assert(rdata->len == SizeOfXLogRecord);
 
 	/*
-	 * Append the data, including backup blocks if any
+	 * When we write the record, we initially leave xl_tot_len at zero,
+	 * and set it to the correct value only after copying the rest of the
+	 * record in place. That way when a process sees that xl_tot_len is set,
+	 * it knows that the record is fully copied in place (or the part that
+	 * fits on this page, anyway).
 	 */
-	rdata = &hdr_rdt;
-	while (write_len)
+	Assert(rechdr->xl_tot_len == write_len);
+	rechdr->xl_tot_len = 0;
+
+	/* Get the right WAL page to start inserting to */
+	CurrPos = StartPos;
+	currpos = GetXLogBuffer(CurrPos, false);
+	freespace = INSERT_FREESPACE(CurrPos);
+
+	/*
+	 * there should be enough space for at least the first field (xl_tot_len)
+	 * on this page.
+	 */
+	Assert(freespace >= sizeof(uint32));
+	tot_len_p = (uint32 *) currpos;
+
+	/* Copy record data */
+	written = 0;
+	while (rdata != NULL)
 	{
-		while (rdata->data == NULL)
-			rdata = rdata->next;
+		char	   *rdata_data = rdata->data;
+		int			rdata_len = rdata->len;
 
-		if (freespace > 0)
+		while (rdata_len > freespace)
 		{
-			if (rdata->len > freespace)
+			/*
+			 * Write what fits on this page, and continue on the next page.
+			 */
+			Assert (((uint64) currpos) % XLOG_BLCKSZ >= SizeOfXLogShortPHD || freespace == 0);
+			memcpy(currpos, rdata_data, freespace);
+			rdata_data += freespace;
+			rdata_len -= freespace;
+			written += freespace;
+			XLByteAdvance(CurrPos, freespace);
+
+			/*
+			 * Before we step to the next page, let others know that we're done
+			 * copying to this page, by setting xl_tot_len (or
+			 * XLP_FIRST_IS_CONT_RECORD, if we're continuing from previous
+			 * page).
+			 */
+			pg_write_barrier();
+			if (firstpage)
 			{
-				memcpy(Insert->currpos, rdata->data, freespace);
-				rdata->data += freespace;
-				rdata->len -= freespace;
-				write_len -= freespace;
+				*tot_len_p = write_len;
+				firstpage = false;
 			}
 			else
+				pagehdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
+
+			/*
+			 * Get pointer to beginning of next page, and set the xlp_rem_len
+			 * in the page header. We don't set XLP_FIRST_IS_CONTRECORD yet,
+			 * that is used to signal that we're done copying, so it's done
+			 * last.
+			 *
+			 * It's safe to set the contrecord flag  and xlp_rem_len without a
+			 * lock on the page. All the other flags were already set when the
+			 * page was initialized, in AdvanceXLInsertBuffer, and we're the
+			 * only backend that needs to set the contrecord flag.
+			 */
+			currpos = GetXLogBuffer(CurrPos, false);
+			pagehdr = (XLogPageHeader) currpos;
+			pagehdr->xlp_rem_len = write_len - written;
+
+			/* skip over the page header */
+			if (CurrPos.xrecoff % XLogSegSize == 0)
 			{
-				memcpy(Insert->currpos, rdata->data, rdata->len);
-				freespace -= rdata->len;
-				write_len -= rdata->len;
-				Insert->currpos += rdata->len;
-				rdata = rdata->next;
-				continue;
+				CurrPos.xrecoff += SizeOfXLogLongPHD;
+				currpos += SizeOfXLogLongPHD;
 			}
+			else
+			{
+				CurrPos.xrecoff += SizeOfXLogShortPHD;
+				currpos += SizeOfXLogShortPHD;
+			}
+			freespace = INSERT_FREESPACE(CurrPos);
 		}
 
-		/* Use next buffer */
-		updrqst = AdvanceXLInsertBuffer(false);
-		curridx = Insert->curridx;
-		/* Insert cont-record header */
-		Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
-		Insert->currpage->xlp_rem_len = write_len;
-		freespace = INSERT_FREESPACE(Insert);
+		Assert (((uint64) currpos) % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0);
+		memcpy(currpos, rdata_data, rdata_len);
+		currpos += rdata_len;
+		XLByteAdvance(CurrPos, rdata_len);
+		freespace -= rdata_len;
+		written += rdata_len;
+
+		rdata = rdata->next;
 	}
+	Assert(written == write_len);
 
-	/* Ensure next record will be properly aligned */
-	Insert->currpos = (char *) Insert->currpage +
-		MAXALIGN(Insert->currpos - (char *) Insert->currpage);
-	freespace = INSERT_FREESPACE(Insert);
+	/* Align the end position, so that the next record starts aligned */
+	if (CurrPos.xrecoff % MAXIMUM_ALIGNOF != 0)
+	{
+		CurrPos.xrecoff = MAXALIGN(CurrPos.xrecoff);
+		if (CurrPos.xrecoff == 0)
+		{
+			/* crossed a logid boundary */
+			CurrPos.xlogid += 1;
+		}
+	}
 
 	/*
-	 * The recptr I return is the beginning of the *next* record. This will be
-	 * stored as LSN for changed data pages...
+	 * Done! Let others know that we're finished.
 	 */
-	INSERT_RECPTR(RecPtr, Insert, curridx);
+	pg_write_barrier();
+	if (firstpage)
+		*tot_len_p = write_len;
+	else
+		pagehdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
 
 	/*
-	 * If the record is an XLOG_SWITCH, we must now write and flush all the
-	 * existing data, and then forcibly advance to the start of the next
-	 * segment.  It's not good to do this I/O while holding the insert lock,
-	 * but there seems too much risk of confusion if we try to release the
-	 * lock sooner.  Fortunately xlog switch needn't be a high-performance
-	 * operation anyway...
+	 * If this was an xlog-switch, it's not enough to write the switch record,
+	 * we also have to consume all the remaining space in the WAL segment.
+	 * We have already reserved it for us, but we still need to make sure it's
+	 * allocated and zeroed in the WAL buffers so that when the caller (or
+	 * someone else) does XLogWrite(), it can really write out all the zeros.
 	 */
-	if (isLogSwitch)
+	if (isLogSwitch && CurrPos.xrecoff % XLOG_SEG_SIZE != 0)
 	{
-		XLogwrtRqst FlushRqst;
-		XLogRecPtr	OldSegEnd;
+		volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
 
-		TRACE_POSTGRESQL_XLOG_SWITCH();
+		WaitXLogInsertionsToFinish(CurrPos);
 
-		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+		/* An xlog-switch record doesn't contain any data besides the header */
+		Assert(write_len == SizeOfXLogRecord);
 
 		/*
-		 * Flush through the end of the page containing XLOG_SWITCH, and
-		 * perform end-of-segment actions (eg, notifying archiver).
+		 * We do this one page at a time, to make sure we don't deadlock
+		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
 		 */
-		WriteRqst = XLogCtl->xlblocks[curridx];
-		FlushRqst.Write = WriteRqst;
-		FlushRqst.Flush = WriteRqst;
-		XLogWrite(FlushRqst, false, true);
-
-		/* Set up the next buffer as first page of next segment */
-		/* Note: AdvanceXLInsertBuffer cannot need to do I/O here */
-		(void) AdvanceXLInsertBuffer(true);
+		Assert(EndPos.xrecoff % XLogSegSize == 0);
 
-		/* There should be no unwritten data */
-		curridx = Insert->curridx;
-		Assert(curridx == XLogCtl->Write.curridx);
+		/* Use up all the remaining space on the first page */
+		XLByteAdvance(CurrPos, freespace);
 
-		/* Compute end address of old segment */
-		OldSegEnd = XLogCtl->xlblocks[curridx];
-		if (OldSegEnd.xrecoff == 0)
+		while (XLByteLT(CurrPos, EndPos))
 		{
-			/* crossing a logid boundary */
-			OldSegEnd.xlogid -= 1;
+			/* initialize the next page (if not initialized already) */
+			AdvanceXLInsertBuffer(CurrPos, false);
+			XLByteAdvance(CurrPos, XLOG_BLCKSZ);
+
+			/*
+			 * Update FinalizedUpto immediately. FinalizeRecord() doesn't know
+			 * that an xlog-switch record consumes the rest of the segment,
+			 * so we have to do this ourselves.
+			 */
+			LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+			FinalizedUpto = Insert->FinalizedUpto = CurrPos;
+			Insert->ExpectingContRecord = false;
+			Assert(XLByteEQ(Insert->LastFinalizedRecord, StartPos));
+			LWLockRelease(WALInsertTailLock);
 		}
-		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
+	}
+	if (!XLByteEQ(CurrPos, EndPos))
+		elog(PANIC, "space reserved for WAL record does not match what was written");
+}
+
+/*
+ * Reserves the right amount of space for a record of given size from the WAL.
+ * *StartPos_p is set to the beginning of the reserved section, *EndPos_p to
+ * its end+1.
+ *
+ * This is the performance critical part of XLogInsert that must be serialized
+ * across backends. The rest can happen mostly in parallel.
+ *
+ * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
+ * where we actually copy the record to the reserved space.
+ */
+static void
+ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos)
+{
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+	uint64		startbytepos;
+	uint64		endbytepos;
 
-		/* Make it look like we've written and synced all of old segment */
-		LogwrtResult.Write = OldSegEnd;
-		LogwrtResult.Flush = OldSegEnd;
+	size = MAXALIGN(size);
 
-		/*
-		 * Update shared-memory status --- this code should match XLogWrite
-		 */
-		{
-			/* use volatile pointer to prevent code rearrangement */
-			volatile XLogCtlData *xlogctl = XLogCtl;
+	/* All (non xlog-switch) records should contain data. */
+	Assert(size > SizeOfXLogRecord);
 
-			SpinLockAcquire(&xlogctl->info_lck);
-			xlogctl->LogwrtResult = LogwrtResult;
-			if (XLByteLT(xlogctl->LogwrtRqst.Write, LogwrtResult.Write))
-				xlogctl->LogwrtRqst.Write = LogwrtResult.Write;
-			if (XLByteLT(xlogctl->LogwrtRqst.Flush, LogwrtResult.Flush))
-				xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
-			SpinLockRelease(&xlogctl->info_lck);
-		}
+	SpinLockAcquire(&Insert->insertpos_lck);
 
-		LWLockRelease(WALWriteLock);
+	startbytepos = Insert->CurrBytePos;
+	endbytepos = startbytepos + size;
+	Insert->CurrBytePos = endbytepos;
+
+	SpinLockRelease(&Insert->insertpos_lck);
+
+	*StartPos = XLogBytePosToRecPtr(startbytepos);
+	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
+	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
+	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
+}
+
+/*
+ * Like ReserveXLogInsertLocation(), but for an xlog-switch record.
+ *
+ * A log-switch record is handled slightly differently. The rest of the
+ * segment will be reserved for this insertion, as indicated by the returned
+ * *EndPos_p value. However, if we are already at the beginning of the current
+ * segment, the *EndPos_p is set to the current location without reserving
+ * any space, and the function returns false.
+*/
+static bool
+ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos)
+{
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+	uint64		startbytepos;
+	uint64		endbytepos;
+	uint32		size = SizeOfXLogRecord;
+	XLogRecPtr	ptr;
+	uint32		segleft;
+
+	SpinLockAcquire(&Insert->insertpos_lck);
+
+	startbytepos = Insert->CurrBytePos;
+
+	ptr = XLogBytePosToEndRecPtr(startbytepos);
+	if (ptr.xrecoff % XLOG_SEG_SIZE == 0)
+	{
+		SpinLockRelease(&Insert->insertpos_lck);
+		*EndPos = *StartPos = ptr;
+		return false;
+	}
+
+	*StartPos = XLogBytePosToRecPtr(startbytepos);
+
+	endbytepos = startbytepos + size;
 
-		updrqst = false;		/* done already */
+	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
+	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
+
+	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
+
+	segleft = XLOG_SEG_SIZE - (EndPos->xrecoff % XLOG_SEG_SIZE);
+	if (segleft != XLOG_SEG_SIZE)
+	{
+		/* consume the rest of the segment */
+		EndPos->xrecoff += segleft;
+		endbytepos = XLogRecPtrToBytePos(*EndPos);
+	}
+	Insert->CurrBytePos = endbytepos;
+
+	SpinLockRelease(&Insert->insertpos_lck);
+
+	Assert(EndPos->xrecoff % XLOG_BLCKSZ == 0);
+
+	return true;
+}
+
+/*
+ * Get a pointer to the right location in the WAL buffer containing the
+ * given XLogRecPtr.
+ *
+ * If the page is not initialized yet, it is initialized. That might require
+ * evicting an old dirty buffer from the buffer cache, which means I/O.
+ * Unless failok == true, in which case the function returns NULL instead.
+ *
+ * The caller must ensure that the page containing the requested location
+ * isn't evicted yet, and won't be evicted. If you have reserved some WAL
+ * space, and not yet marked that you're done inserting it (by not having
+ * set xl_tot_len yet), that is enough. You should not be holding onto
+ * anything < ptr, though, because that might lead to deadlock if we would
+ * need to evict an old buffer to make room for the new one.
+ */
+static char *
+GetXLogBuffer(XLogRecPtr ptr, bool failok)
+{
+	int			idx;
+	XLogRecPtr	endptr;
+	static uint32 cachedXlogid = 0;
+	static uint32 cachedPage = 0;
+	static char *cachedPos = NULL;
+	XLogRecPtr	expectedEndPtr;
+
+	/*
+	 * Fast path for the common case that we need to access again the same
+	 * page as last time.
+	 */
+	if (ptr.xlogid == cachedXlogid && ptr.xrecoff / XLOG_BLCKSZ == cachedPage)
+	{
+		Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
+		Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr.xlogid == cachedXlogid);
+		Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr.xrecoff == cachedPage * XLOG_BLCKSZ);
+		return cachedPos + ptr.xrecoff % XLOG_BLCKSZ;
+	}
+
+	/*
+	 * The XLog buffer cache is organized so that a page must always be loaded
+	 * to a particular buffer.  That way we can easily calculate the buffer
+	 * a given page must be loaded into, from the XLogRecPtr alone.
+	 */
+	idx = XLogRecPtrToBufIdx(ptr);
+
+	/*
+	 * See what page is loaded in the buffer at the moment. It could be the
+	 * page we're looking for, or something older. It can't be anything newer
+	 * - that would imply the page we're looking for has already been written
+	 * out to disk and evicted, and the caller is responsible for making sure
+	 * that doesn't happen.
+	 *
+	 * However, we don't hold a lock while we read the value. If someone has
+	 * just initialized the page, it's possible that we get a "torn read" of
+	 * the XLogRecPtr, and see a bogus value. That's ok, we'll grab the
+	 * mapping lock (in AdvanceXLInsertBuffer) and retry if we see anything
+	 * else than the page we're looking for. But it means that when we do this
+	 * unlocked read, we might see a value that appears to be ahead of the
+	 * page we're looking for. Don't PANIC on that, until we've verified the
+	 * value while holding the lock.
+	 */
+	expectedEndPtr = ptr;
+	XLByteAdvance(expectedEndPtr, XLOG_BLCKSZ - ptr.xrecoff % XLOG_BLCKSZ);
+
+	endptr = XLogCtl->xlblocks[idx];
+	if (!XLByteEQ(expectedEndPtr, endptr))
+	{
+		if (failok)
+			return NULL;
+
+		AdvanceXLInsertBuffer(ptr, false);
+		endptr = XLogCtl->xlblocks[idx];
+
+		if (!XLByteEQ(expectedEndPtr, endptr))
+			elog(PANIC, "could not find WAL buffer for %X/%X",
+				 ptr.xlogid, ptr.xrecoff);
+	}
+
+	/*
+	 * Found the buffer holding this page. Return a pointer to the right
+	 * offset within the page.
+	 */
+	cachedXlogid = ptr.xlogid;
+	cachedPage = ptr.xrecoff / XLOG_BLCKSZ;
+	cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+
+	Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
+	Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr.xlogid == cachedXlogid);
+	Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr.xrecoff == cachedPage * XLOG_BLCKSZ);
+	Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr.xrecoff == ptr.xrecoff - (ptr.xrecoff % XLOG_BLCKSZ));
+
+	return cachedPos + ptr.xrecoff % XLOG_BLCKSZ;
+}
+
+/*
+ * Converts a "usable byte position" to XLogRecPtr. A usable byte position
+ * is the position starting from the beginning of WAL, excluding all WAL
+ * page headers.
+ */
+static XLogRecPtr
+XLogBytePosToRecPtr(uint64 bytepos)
+{
+	uint64		fullsegs;
+	uint64		fullpages;
+	uint64		bytesleft;
+	uint32		seg_offset;
+	XLogRecPtr	result;
+
+	fullsegs = bytepos / UsableBytesInSegment;
+	bytesleft = bytepos % UsableBytesInSegment;
+
+	if (bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD)
+	{
+		/* fits on first page of segment */
+		seg_offset = bytesleft + SizeOfXLogLongPHD;
+	}
+	else
+	{
+		/* account for the first page on segment with long header */
+		seg_offset = XLOG_BLCKSZ;
+		bytesleft -= XLOG_BLCKSZ - SizeOfXLogLongPHD;
+
+		fullpages = bytesleft / UsableBytesInPage;
+		bytesleft = bytesleft % UsableBytesInPage;
+
+		seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + SizeOfXLogShortPHD;
+	}
+
+	XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, result);
+
+	return result;
+}
+
+/*
+ * Like XLogBytePosToEndRecPtr, but a page boundary is represented by pointer
+ * to beginning of page, not to where the first xlog record goes to.
+ */
+static XLogRecPtr
+XLogBytePosToEndRecPtr(uint64 bytepos)
+{
+	uint64		fullsegs;
+	uint64		fullpages;
+	uint64		bytesleft;
+	uint32		seg_offset;
+	XLogRecPtr	result;
+
+	fullsegs = bytepos / UsableBytesInSegment;
+	bytesleft = bytepos % UsableBytesInSegment;
+
+	if (bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD)
+	{
+		/* fits on first page of segment */
+		if (bytesleft == 0)
+			seg_offset = 0;
+		else
+			seg_offset = bytesleft + SizeOfXLogLongPHD;
 	}
 	else
 	{
-		/* normal case, ie not xlog switch */
+		/* account for the first page on segment with long header */
+		seg_offset = XLOG_BLCKSZ;
+		bytesleft -= XLOG_BLCKSZ - SizeOfXLogLongPHD;
+
+		fullpages = bytesleft / UsableBytesInPage;
+		bytesleft = bytesleft % UsableBytesInPage;
+
+		if (bytesleft == 0)
+			seg_offset += fullpages * XLOG_BLCKSZ + bytesleft;
+		else
+			seg_offset += fullpages * XLOG_BLCKSZ + bytesleft + SizeOfXLogShortPHD;
+	}
+
+	XLogSegNoOffsetToRecPtr(fullsegs, seg_offset, result);
+
+	return result;
+}
+
+/*
+ * Convert an XLogRecPtr to a "usable byte position".
+ */
+static uint64
+XLogRecPtrToBytePos(XLogRecPtr ptr)
+{
+	uint64		fullsegs;
+	uint32		fullpages;
+	uint32		offset;
+	uint64		result;
+
+	XLByteToSeg(ptr, fullsegs);
 
-		/* Need to update shared LogwrtRqst if some block was filled up */
-		if (freespace == 0)
+	fullpages = (ptr.xrecoff % XLOG_SEG_SIZE) / XLOG_BLCKSZ;
+	offset = ptr.xrecoff % XLOG_BLCKSZ;
+
+	if (fullpages == 0)
+	{
+		result = fullsegs * UsableBytesInSegment;
+		if (offset > 0)
 		{
-			/* curridx is filled and available for writing out */
-			updrqst = true;
+			Assert(offset >= SizeOfXLogLongPHD);
+			result += offset - SizeOfXLogLongPHD;
 		}
-		else
+	}
+	else
+	{
+		result = fullsegs * UsableBytesInSegment +
+			(XLOG_BLCKSZ - SizeOfXLogLongPHD) +  /* account for first page */
+			(fullpages - 1) * UsableBytesInPage; /* full pages */
+		if (offset > 0)
 		{
-			/* if updrqst already set, write through end of previous buf */
-			curridx = PrevBufIdx(curridx);
+			Assert(offset >= SizeOfXLogShortPHD);
+			result += offset - SizeOfXLogShortPHD;
 		}
-		WriteRqst = XLogCtl->xlblocks[curridx];
 	}
 
-	LWLockRelease(WALInsertLock);
+	return result;
+}
+
+/*
+ * Attempt to finalize next record, if it's been copied in place.
+ */
+static bool
+FinalizeRecord(void)
+{
+	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	XLogRecPtr	ptr;
+	uint32		len;
+	char	   *p;
+	int			freespace;
+	XLogRecPtr StartPos, EndPos;
+
+	ptr = XLogCtl->Insert.FinalizedUpto;
+	p = GetXLogBuffer(ptr, true);
+	if (p == NULL)
+		return false;
+
+	StartPos = ptr;
 
-	if (updrqst)
+	/*
+	 * If LastFinalizedRecord points to beginning of page, assume it's
+	 * a continuation record.
+	 */
+	if (XLogCtl->Insert.ExpectingContRecord)
 	{
-		/* use volatile pointer to prevent code rearrangement */
-		volatile XLogCtlData *xlogctl = XLogCtl;
+		XLogPageHeader pagehdr = (XLogPageHeader) p;
+		int		pagehdrsize;
 
-		SpinLockAcquire(&xlogctl->info_lck);
-		/* advance global request to include new block(s) */
-		if (XLByteLT(xlogctl->LogwrtRqst.Write, WriteRqst))
-			xlogctl->LogwrtRqst.Write = WriteRqst;
-		/* update local result copy while I have the chance */
-		LogwrtResult = xlogctl->LogwrtResult;
-		SpinLockRelease(&xlogctl->info_lck);
+		Assert(pagehdr->xlp_magic == XLOG_PAGE_MAGIC);
+		pagehdrsize = (ptr.xrecoff % XLOG_SEG_SIZE == 0) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD;
+		Assert(pagehdrsize == XLogPageHeaderSize(pagehdr));
+
+		if ((pagehdr->xlp_info & XLP_FIRST_IS_CONTRECORD) == 0)
+			return false;
+
+		pg_memory_barrier();
+
+		/* Cool, the part of this continued record on this page is done */
+		len = MAXALIGN(pagehdr->xlp_rem_len);
+		Assert(len > 0 && len < 1000000);
+		if (len < XLOG_BLCKSZ - pagehdrsize)
+		{
+			ptr.xrecoff += pagehdrsize + len;
+			Insert->ExpectingContRecord = false;
+		}
+		else if (len == XLOG_BLCKSZ - pagehdrsize)
+		{
+			XLByteAdvance(ptr, XLOG_BLCKSZ);
+			Insert->ExpectingContRecord = false;
+		}
+		else
+		{
+			XLByteAdvance(ptr, XLOG_BLCKSZ);
+			Insert->ExpectingContRecord = true;
+		}
+		FinalizedUpto = XLogCtl->Insert.FinalizedUpto = ptr;
+		EndPos = ptr;
 	}
+	else
+	{
+		XLogRecPtr	LastFinalized = Insert->LastFinalizedRecord;
+		pg_crc32	rdata_crc;
+		XLogRecPtr	recstart;
+		char	   *recstartp;
+
+		/* if we're located at page boundary, skip page header */
+		if (ptr.xrecoff % XLOG_BLCKSZ == 0)
+		{
+			if (ptr.xrecoff % XLOG_SEG_SIZE == 0)
+			{
+				ptr.xrecoff += SizeOfXLogLongPHD;
+				p += SizeOfXLogLongPHD;
+			}
+			else
+			{
+				ptr.xrecoff += SizeOfXLogShortPHD;
+				p += SizeOfXLogShortPHD;
+			}
+		}
 
-	XactLastRecEnd = RecPtr;
+		recstart = ptr;
+		recstartp = p;
 
-	END_CRIT_SECTION();
+		/* NB: we might not have the full header on this page! */
+		/* fetch record->xl_tot_len */
+		len = MAXALIGN(*((uint32 *) p));
+		if (len == 0)
+			return false;
 
-	return RecPtr;
+		pg_memory_barrier();
+
+		/* Cool, this record is done. Set xl_prev, and finish CRC calculation. */
+		/* xl_prev might be on next page */
+		freespace = INSERT_FREESPACE(ptr);
+		if (freespace < offsetof(XLogRecord, xl_prev) + sizeof(XLogRecPtr))
+		{
+			XLogPageHeader pagehdr;
+			int		pagehdrsize;
+			int		off = offsetof(XLogRecord, xl_prev) - freespace;
+
+			XLByteAdvance(ptr, freespace);
+			p = GetXLogBuffer(ptr, true);
+			if (p == NULL)
+				return false;
+
+			pagehdr = (XLogPageHeader) p;
+			pagehdrsize = (ptr.xrecoff % XLOG_SEG_SIZE == 0) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD;
+
+			Assert(pagehdr->xlp_magic == XLOG_PAGE_MAGIC);
+			Assert(pagehdrsize == XLogPageHeaderSize(pagehdr));
+
+			/*
+			 * If the rest of the record header has not been copied in place
+			 * yet, bail out.
+			 */
+			if ((pagehdr->xlp_info & XLP_FIRST_IS_CONTRECORD) == 0)
+				return false;
+			p += pagehdrsize + off;
+		}
+		else
+		{
+			p += offsetof(XLogRecord, xl_prev);
+		}
+		Assert (((uint64) p) % XLOG_BLCKSZ >= SizeOfXLogShortPHD);
+		*((XLogRecPtr *) p) = LastFinalized;
+
+		/* xl_crc might be on next page, if xl_prev was not */
+		ptr = recstart;
+		p = recstartp;
+
+		freespace = INSERT_FREESPACE(ptr);
+		if (freespace < offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32))
+		{
+			XLogPageHeader pagehdr;
+			int		pagehdrsize;
+			int		off = offsetof(XLogRecord, xl_crc) - freespace;
+
+			XLByteAdvance(ptr, freespace);
+			p = GetXLogBuffer(ptr, true);
+			if (p == NULL)
+				return false;
+
+			pagehdr = (XLogPageHeader) p;
+			pagehdrsize = (ptr.xrecoff % XLOG_SEG_SIZE == 0) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD;
+
+			Assert(pagehdr->xlp_magic == XLOG_PAGE_MAGIC);
+			Assert(pagehdrsize == XLogPageHeaderSize(pagehdr));
+
+			if ((pagehdr->xlp_info & XLP_FIRST_IS_CONTRECORD) == 0)
+				return false;
+			p += pagehdrsize + off;
+		}
+		else
+		{
+			p += offsetof(XLogRecord, xl_crc);
+		}
+		Assert (((uint64) p) % XLOG_BLCKSZ >= SizeOfXLogShortPHD);
+
+		/* Update CRC with xl_prev, finish it with FIN_CRC32, and write back */
+		rdata_crc = *((pg_crc32 *) p);
+		COMP_CRC32(rdata_crc, ((char *) &LastFinalized), sizeof(XLogRecPtr));
+		FIN_CRC32(rdata_crc);
+		*((pg_crc32 *) p) = rdata_crc;
+
+		/*
+		 * Update FinalizedUpto to end of record, or end of page where this
+		 * record began, if it didn't fit page.
+		 */
+		ptr = recstart;
+		freespace = INSERT_FREESPACE(ptr);
+		if (len <= freespace)
+		{
+			XLByteAdvance(ptr, len);
+			Insert->ExpectingContRecord = false;
+		}
+		else
+		{
+			XLByteAdvance(ptr, freespace);
+			Insert->ExpectingContRecord = true;
+		}
+		FinalizedUpto = XLogCtl->Insert.FinalizedUpto = ptr;
+		EndPos = ptr;
+		/*
+		 * Update LastFinalizedRecord, so that we can set xl_prev link on
+		 * the next record correctly.
+		 */
+		XLogCtl->Insert.LastFinalizedRecord = recstart;
+	}
+
+#ifdef NOT_USED
+	elog(LOG, "FINALIZE @ %X/%X - %X/%X",
+		 StartPos.xlogid, StartPos.xrecoff, EndPos.xlogid, EndPos.xrecoff);
+#endif
+
+	return true;
+}
+
+/*
+ * Wait for any insertions < upto to finish.
+ *
+ * Returns a value >= upto, which indicates the oldest in-progress insertion
+ * that we saw (or if there are non in-progress, the next insert position).
+ */
+static XLogRecPtr
+WaitXLogInsertionsToFinish(XLogRecPtr upto)
+{
+	if (MyProc == NULL)
+		elog(PANIC, "cannot wait without a PGPROC structure");
+
+	if (XLByteLE(upto, FinalizedUpto))
+		return FinalizedUpto;
+
+	/*
+	 * XXX: Busy-loop until we succeed to finalize up to the requested
+	 * point
+	 */
+	for (;;)
+	{
+		/* Only allow one process to finalize at a time */
+		LWLockAcquire(WALInsertTailLock, LW_EXCLUSIVE);
+
+		/* While we're at it, finalize as far as we can. */
+		while (FinalizeRecord());
+		FinalizedUpto = XLogCtl->Insert.FinalizedUpto;
+
+		LWLockRelease(WALInsertTailLock);
+
+		/* Is this enough? */
+		if (XLogRecPtrIsInvalid(upto) || XLByteLE(upto, FinalizedUpto))
+			return FinalizedUpto;
+	}
 }
 
 /*
@@ -1445,31 +2194,34 @@ XLogArchiveCleanup(const char *xlog)
 }
 
 /*
- * Advance the Insert state to the next buffer page, writing out the next
- * buffer if it still contains unwritten data.
- *
- * If new_segment is TRUE then we set up the next buffer page as the first
- * page of the next xlog segment file, possibly but not usually the next
- * consecutive file page.
- *
- * The global LogwrtRqst.Write pointer needs to be advanced to include the
- * just-filled page.  If we can do this for free (without an extra lock),
- * we do so here.  Otherwise the caller must do it.  We return TRUE if the
- * request update still needs to be done, FALSE if we did it internally.
- *
- * Must be called with WALInsertLock held.
+ * Initialize XLOG buffers, writing out old buffers if they still contain
+ * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
+ * true, initialize as many pages as we can without having to write out
+ * unwritten data. Any new pages are initialized to zeros, with pages headers
+ * initialized properly.
  */
-static bool
-AdvanceXLInsertBuffer(bool new_segment)
+static void
+AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	int			nextidx = NextBufIdx(Insert->curridx);
-	bool		update_needed = true;
+	int			nextidx;
 	XLogRecPtr	OldPageRqstPtr;
 	XLogwrtRqst WriteRqst;
-	XLogRecPtr	NewPageEndPtr;
+	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
+	int			npages = 0;
+
+	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+
+	/*
+	 * Now that we have the lock, check if someone initialized the page
+	 * already.
+	 */
+/* XXX: fix indentation before commit */
+while (!XLByteLT(upto, XLogCtl->xlblocks[XLogCtl->curridx]) || opportunistic)
+{
+	nextidx = NextBufIdx(XLogCtl->curridx);
 
 	/*
 	 * Get ending-offset of the buffer page we need to replace (this may be
@@ -1479,10 +2231,12 @@ AdvanceXLInsertBuffer(bool new_segment)
 	OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
 	if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
 	{
-		/* nope, got work to do... */
-		XLogRecPtr	FinishedPageRqstPtr;
-
-		FinishedPageRqstPtr = XLogCtl->xlblocks[Insert->curridx];
+		/*
+		 * Nope, got work to do. If we just want to pre-initialize as much as
+		 * we can without flushing, give up now.
+		 */
+		if (opportunistic)
+			break;
 
 		/* Before waiting, get info_lck and update LogwrtResult */
 		{
@@ -1490,21 +2244,27 @@ AdvanceXLInsertBuffer(bool new_segment)
 			volatile XLogCtlData *xlogctl = XLogCtl;
 
 			SpinLockAcquire(&xlogctl->info_lck);
-			if (XLByteLT(xlogctl->LogwrtRqst.Write, FinishedPageRqstPtr))
-				xlogctl->LogwrtRqst.Write = FinishedPageRqstPtr;
+			if (XLByteLT(xlogctl->LogwrtRqst.Write, OldPageRqstPtr))
+				xlogctl->LogwrtRqst.Write = OldPageRqstPtr;
 			LogwrtResult = xlogctl->LogwrtResult;
 			SpinLockRelease(&xlogctl->info_lck);
 		}
 
-		update_needed = false;	/* Did the shared-request update */
-
 		/*
 		 * Now that we have an up-to-date LogwrtResult value, see if we still
 		 * need to write it or if someone else already did.
 		 */
 		if (!XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
 		{
-			/* Must acquire write lock */
+			/*
+			 * Must acquire write lock. Release WALBufMappingLock first, to
+			 * make sure that all insertions that we need to wait for can
+			 * finish (up to this same position). Otherwise we risk deadlock.
+			 */
+			LWLockRelease(WALBufMappingLock);
+
+			WaitXLogInsertionsToFinish(OldPageRqstPtr);
+
 			LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 			LogwrtResult = XLogCtl->LogwrtResult;
 			if (XLByteLE(OldPageRqstPtr, LogwrtResult.Write))
@@ -1514,18 +2274,18 @@ AdvanceXLInsertBuffer(bool new_segment)
 			}
 			else
 			{
-				/*
-				 * Have to write buffers while holding insert lock. This is
-				 * not good, so only write as much as we absolutely must.
-				 */
+				/* Have to write it ourselves */
 				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
 				WriteRqst.Write = OldPageRqstPtr;
 				WriteRqst.Flush.xlogid = 0;
 				WriteRqst.Flush.xrecoff = 0;
-				XLogWrite(WriteRqst, false, false);
+				XLogWrite(WriteRqst, false);
 				LWLockRelease(WALWriteLock);
 				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
 			}
+			/* Re-acquire WALBufMappingLock and retry */
+			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+			continue;
 		}
 	}
 
@@ -1533,25 +2293,16 @@ AdvanceXLInsertBuffer(bool new_segment)
 	 * Now the next buffer slot is free and we can set it up to be the next
 	 * output page.
 	 */
-	NewPageBeginPtr = XLogCtl->xlblocks[Insert->curridx];
-
-	if (new_segment)
-	{
-		/* force it to a segment start point */
-		if (NewPageBeginPtr.xrecoff % XLogSegSize != 0)
-			XLByteAdvance(NewPageBeginPtr,
-						  XLogSegSize - NewPageBeginPtr.xrecoff % XLogSegSize);
-	}
+	NewPageBeginPtr = XLogCtl->xlblocks[XLogCtl->curridx];
 
 	NewPageEndPtr = NewPageBeginPtr;
 	XLByteAdvance(NewPageEndPtr, XLOG_BLCKSZ);
-	XLogCtl->xlblocks[nextidx] = NewPageEndPtr;
-	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
 
-	Insert->curridx = nextidx;
-	Insert->currpage = NewPage;
+	Assert(NewPageEndPtr.xrecoff % XLOG_BLCKSZ == 0);
+	Assert(XLogRecEndPtrToBufIdx(NewPageEndPtr) == nextidx);
+	Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
 
-	Insert->currpos = ((char *) NewPage) +SizeOfXLogShortPHD;
+	NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
 
 	/*
 	 * Be sure to re-zero the buffer so that bytes beyond what we've written
@@ -1567,6 +2318,7 @@ AdvanceXLInsertBuffer(bool new_segment)
 	/* NewPage->xlp_info = 0; */	/* done by memset */
 	NewPage   ->xlp_tli = ThisTimeLineID;
 	NewPage   ->xlp_pageaddr = NewPageBeginPtr;
+	/* NewPage	  ->xlp_rem_len = InvalidXLogRecPtr; */	/* done by memset */
 
 	/*
 	 * If online backup is not in progress, mark the header to indicate that
@@ -1594,11 +2346,28 @@ AdvanceXLInsertBuffer(bool new_segment)
 		NewLongPage->xlp_seg_size = XLogSegSize;
 		NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
 		NewPage   ->xlp_info |= XLP_LONG_HEADER;
-
-		Insert->currpos = ((char *) NewPage) +SizeOfXLogLongPHD;
 	}
 
-	return update_needed;
+	/*
+	 * Make sure the initialization of the page becomes visible to others
+	 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
+	 * holding a lock.
+	 */
+	pg_write_barrier();
+
+	*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
+
+	XLogCtl->curridx = nextidx;
+
+	npages++;
+}
+	LWLockRelease(WALBufMappingLock);
+
+#ifdef WAL_DEBUG
+	if (npages > 0)
+		elog(DEBUG1, "initialized %d pages, upto %X/%X",
+			 npages, NewPageEndPtr.xlogid, NewPageEndPtr.xrecoff);
+#endif
 }
 
 /*
@@ -1630,16 +2399,12 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
  * This option allows us to avoid uselessly issuing multiple writes when a
  * single one would do.
  *
- * If xlog_switch == TRUE, we are intending an xlog segment switch, so
- * perform end-of-segment actions after writing the last page, even if
- * it's not physically the end of its segment.  (NB: this will work properly
- * only if caller specifies WriteRqst == page-end and flexible == false,
- * and there is some data to write.)
- *
- * Must be called with WALWriteLock held.
+ * Must be called with WALWriteLock held. And you must've called
+ * WaitXLogInsertionsToFinish(WriteRqst) before grabbing the lock to make sure
+ * the data is ready to write.
  */
 static void
-XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
+XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 {
 	XLogCtlWrite *Write = &XLogCtl->Write;
 	bool		ispartialpage;
@@ -1688,14 +2453,14 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 		 * if we're passed a bogus WriteRqst.Write that is past the end of the
 		 * last page that's been initialized by AdvanceXLInsertBuffer.
 		 */
-		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
+		XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
+		if (!XLByteLT(LogwrtResult.Write, EndPtr))
 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
 				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
-				 XLogCtl->xlblocks[curridx].xlogid,
-				 XLogCtl->xlblocks[curridx].xrecoff);
+				 EndPtr.xlogid, EndPtr.xrecoff);
 
 		/* Advance LogwrtResult.Write to end of current buffer page */
-		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
+		LogwrtResult.Write = EndPtr;
 		ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
 
 		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
@@ -1778,6 +2543,12 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 								XLogFileNameP(ThisTimeLineID, openLogSegNo),
 								openLogOff, (unsigned long) nbytes)));
 			}
+#ifdef CLOBBER_FREED_MEMORY
+			if (!ispartialpage)
+				memset(from, 0x7E, nbytes);
+			else if (npages > 1)
+				memset(from, 0x7E, nbytes - XLOG_BLCKSZ);
+#endif
 
 			/* Update state for write */
 			openLogOff += nbytes;
@@ -1791,16 +2562,13 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			 * later. Doing it here ensures that one and only one backend will
 			 * perform this fsync.
 			 *
-			 * We also do this if this is the last page written for an xlog
-			 * switch.
-			 *
 			 * This is also the right place to notify the Archiver that the
 			 * segment is ready to copy to archival storage, and to update the
 			 * timer for archive_timeout, and to signal for a checkpoint if
 			 * too many logfile segments have been used since the last
 			 * checkpoint.
 			 */
-			if (finishing_seg || (xlog_switch && last_iteration))
+			if (finishing_seg)
 			{
 				issue_xlog_fsync(openLogFile, openLogSegNo);
 				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
@@ -1865,7 +2633,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 				openLogFile = XLogFileOpen(openLogSegNo);
 				openLogOff = 0;
 			}
+			elog(LOG, "flushing seg %ld (explicit)", openLogSegNo);
 			issue_xlog_fsync(openLogFile, openLogSegNo);
+			elog(LOG, "done flushing seg %ld (explicit)", openLogSegNo);
 		}
 		LogwrtResult.Flush = LogwrtResult.Write;
 	}
@@ -2066,6 +2836,7 @@ XLogFlush(XLogRecPtr record)
 	{
 		/* use volatile pointer to prevent code rearrangement */
 		volatile XLogCtlData *xlogctl = XLogCtl;
+		XLogRecPtr	insertpos;
 
 		/* read LogwrtResult and update local state */
 		SpinLockAcquire(&xlogctl->info_lck);
@@ -2079,6 +2850,12 @@ XLogFlush(XLogRecPtr record)
 			break;
 
 		/*
+		 * Before actually performing the write, wait for all in-flight
+		 * insertions to the pages we're about to write to finish.
+		 */
+		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+
+		/*
 		 * Try to get the write lock. If we can't get it immediately, wait
 		 * until it's released, and recheck if we still need to do the flush
 		 * or if the backend that held the lock did it for us already. This
@@ -2098,31 +2875,10 @@ XLogFlush(XLogRecPtr record)
 		LogwrtResult = XLogCtl->LogwrtResult;
 		if (!XLByteLE(record, LogwrtResult.Flush))
 		{
-			/* try to write/flush later additions to XLOG as well */
-			if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE))
-			{
-				XLogCtlInsert *Insert = &XLogCtl->Insert;
-				uint32		freespace = INSERT_FREESPACE(Insert);
+			WriteRqst.Write = insertpos;
+			WriteRqst.Flush = insertpos;
 
-				if (freespace == 0)		/* buffer is full */
-					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
-				else
-				{
-					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
-					if (WriteRqstPtr.xrecoff == 0)
-						WriteRqstPtr.xlogid--;
-					WriteRqstPtr.xrecoff -= freespace;
-				}
-				LWLockRelease(WALInsertLock);
-				WriteRqst.Write = WriteRqstPtr;
-				WriteRqst.Flush = WriteRqstPtr;
-			}
-			else
-			{
-				WriteRqst.Write = WriteRqstPtr;
-				WriteRqst.Flush = record;
-			}
-			XLogWrite(WriteRqst, false, false);
+			XLogWrite(WriteRqst, false);
 		}
 		LWLockRelease(WALWriteLock);
 		/* done */
@@ -2240,7 +2996,8 @@ XLogBackgroundFlush(void)
 
 	START_CRIT_SECTION();
 
-	/* now wait for the write lock */
+	/* now wait for any in-progress insertions to finish and get write lock */
+	WaitXLogInsertionsToFinish(WriteRqstPtr);
 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	if (!XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
@@ -2249,13 +3006,19 @@ XLogBackgroundFlush(void)
 
 		WriteRqst.Write = WriteRqstPtr;
 		WriteRqst.Flush = WriteRqstPtr;
-		XLogWrite(WriteRqst, flexible, false);
+		XLogWrite(WriteRqst, flexible);
 		wrote_something = true;
 	}
 	LWLockRelease(WALWriteLock);
 
 	END_CRIT_SECTION();
 
+	/*
+	 * Great, done. To take some work off the critical path, try to initialize
+	 * as many of the no-longer-needed WAL buffers for future use as we can.
+	 */
+	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+
 	return wrote_something;
 }
 
@@ -5066,6 +5829,7 @@ XLOGShmemSize(void)
 
 	/* XLogCtl */
 	size = sizeof(XLogCtlData);
+
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -5091,8 +5855,7 @@ XLOGShmemInit(void)
 
 	ControlFile = (ControlFileData *)
 		ShmemInitStruct("Control File", sizeof(ControlFileData), &foundCFile);
-	XLogCtl = (XLogCtlData *)
-		ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &foundXLog);
+	allocptr = ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &foundXLog);
 
 	if (foundCFile || foundXLog)
 	{
@@ -5100,7 +5863,7 @@ XLOGShmemInit(void)
 		Assert(foundCFile && foundXLog);
 		return;
 	}
-
+	XLogCtl = (XLogCtlData *) allocptr;
 	memset(XLogCtl, 0, sizeof(XLogCtlData));
 
 	/*
@@ -5108,7 +5871,7 @@ XLOGShmemInit(void)
 	 * multiple of the alignment for same, so no extra alignment padding is
 	 * needed here.
 	 */
-	allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
+	allocptr += sizeof(XLogCtlData);
 	XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
 	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
 	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
@@ -5128,7 +5891,12 @@ XLOGShmemInit(void)
 	XLogCtl->SharedRecoveryInProgress = true;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
-	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
+
+	XLogCtl->Insert.LastFinalizedRecord = InvalidXLogRecPtr;
+	XLogCtl->Insert.FinalizedUpto = InvalidXLogRecPtr;
+	XLogCtl->Insert.ExpectingContRecord = false;
+
+	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
 
@@ -6006,6 +6774,7 @@ StartupXLOG(void)
 	bool		backupEndRequired = false;
 	bool		backupFromStandby = false;
 	DBState		dbstate_at_startup;
+	int			firstIdx;
 
 	/*
 	 * Read control file and check XLOG status looks valid.
@@ -6258,7 +7027,7 @@ StartupXLOG(void)
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
-	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+	RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
 
 	if (XLByteLT(RecPtr, checkPoint.redo))
 		ereport(PANIC,
@@ -6814,9 +7583,13 @@ StartupXLOG(void)
 	openLogFile = XLogFileOpen(openLogSegNo);
 	openLogOff = 0;
 	Insert = &XLogCtl->Insert;
-	Insert->PrevRecord = LastRec;
-	XLogCtl->xlblocks[0].xlogid = (openLogSegNo * XLOG_SEG_SIZE) >> 32;
-	XLogCtl->xlblocks[0].xrecoff =
+	Insert->LastFinalizedRecord = LastRec;
+
+	firstIdx = XLogRecEndPtrToBufIdx(EndOfLog);
+	XLogCtl->curridx = firstIdx;
+
+	XLogCtl->xlblocks[firstIdx].xlogid = (openLogSegNo * XLOG_SEG_SIZE) >> 32;
+	XLogCtl->xlblocks[firstIdx].xrecoff =
 		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
 
 	/*
@@ -6824,10 +7597,11 @@ StartupXLOG(void)
 	 * record spans, not the one it starts in.	The last block is indeed the
 	 * one we want to use.
 	 */
-	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
-	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
-	Insert->currpos = (char *) Insert->currpage +
-		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
+	Assert(readOff == (XLogCtl->xlblocks[firstIdx].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
+	memcpy((char *) &XLogCtl->pages[firstIdx * XLOG_BLCKSZ], readBuf, XLOG_BLCKSZ);
+	Insert->FinalizedUpto = EndOfLog;
+	Insert->ExpectingContRecord = false;
+	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
@@ -6836,12 +7610,12 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	freespace = INSERT_FREESPACE(Insert);
+	freespace = INSERT_FREESPACE(EndOfLog);
 	if (freespace > 0)
 	{
 		/* Make sure rest of page is zero */
-		MemSet(Insert->currpos, 0, freespace);
-		XLogCtl->Write.curridx = 0;
+		MemSet(&XLogCtl->pages[firstIdx * XLOG_BLCKSZ] + EndOfLog.xrecoff % XLOG_BLCKSZ, 0, freespace);
+		XLogCtl->Write.curridx = firstIdx;
 	}
 	else
 	{
@@ -6853,7 +7627,7 @@ StartupXLOG(void)
 		 * this is sufficient.	The first actual attempt to insert a log
 		 * record will advance the insert state.
 		 */
-		XLogCtl->Write.curridx = NextBufIdx(0);
+		XLogCtl->Write.curridx = NextBufIdx(firstIdx);
 	}
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
@@ -6864,7 +7638,7 @@ StartupXLOG(void)
 	 * XLOG_FPW_CHANGE record before resource manager writes cleanup
 	 * WAL records or checkpoint record is written.
 	 */
-	Insert->fullPageWrites = lastFullPageWrites;
+	Insert->fullPageWrites = doPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7332,21 +8106,29 @@ InitXLOGAccess(void)
 }
 
 /*
- * Once spawned, a backend may update its local RedoRecPtr from
- * XLogCtl->Insert.RedoRecPtr; it must hold the insert lock or info_lck
- * to do so.  This is done in XLogInsert() or GetRedoRecPtr().
+ * Return the current Redo pointer from shared memory.
+ *
+ * As a side-effect, the local RedoRecPtr copy is updated.
  */
 XLogRecPtr
 GetRedoRecPtr(void)
 {
 	/* use volatile pointer to prevent code rearrangement */
 	volatile XLogCtlData *xlogctl = XLogCtl;
+	XLogRecPtr ptr;
 
+	/*
+	 * The possibly not up-to-date copy in XlogCtl is enough. Even if we
+	 * grabbed WALInsertShareLock to read the master copy, someone might update
+	 * it just after we've released the lock.
+	 */
 	SpinLockAcquire(&xlogctl->info_lck);
-	Assert(XLByteLE(RedoRecPtr, xlogctl->Insert.RedoRecPtr));
-	RedoRecPtr = xlogctl->Insert.RedoRecPtr;
+	ptr = xlogctl->RedoRecPtr;
 	SpinLockRelease(&xlogctl->info_lck);
 
+	if (XLByteLT(RedoRecPtr, ptr))
+		RedoRecPtr = xlogctl->RedoRecPtr;
+
 	return RedoRecPtr;
 }
 
@@ -7355,7 +8137,7 @@ GetRedoRecPtr(void)
  *
  * NOTE: The value *actually* returned is the position of the last full
  * xlog page. It lags behind the real insert position by at most 1 page.
- * For that, we don't need to acquire WALInsertLock which can be quite
+ * For that, we don't need to acquire WALInsertShareLock which is
  * heavily contended, and an approximation is enough for the current
  * usage of this function.
  */
@@ -7630,6 +8412,8 @@ LogCheckpointEnd(bool restartpoint)
 void
 CreateCheckPoint(int flags)
 {
+	/* use volatile pointer to prevent code rearrangement */
+	volatile XLogCtlData *xlogctl = XLogCtl;
 	bool		shutdown;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
@@ -7641,6 +8425,7 @@ CreateCheckPoint(int flags)
 	XLogSegNo	insert_logSegNo;
 	TransactionId *inCommitXids;
 	int			nInCommit;
+	XLogRecPtr	curInsert;
 
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
@@ -7709,10 +8494,11 @@ CreateCheckPoint(int flags)
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
 	/*
-	 * We must hold WALInsertLock while examining insert state to determine
+	 * We must hold insertpos_lck while examining insert state to determine
 	 * the checkpoint REDO pointer.
 	 */
-	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+	WALInsertLockAcquire(LW_EXCLUSIVE);
+	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
 	/*
 	 * If this isn't a shutdown or forced checkpoint, and we have not switched
@@ -7724,7 +8510,7 @@ CreateCheckPoint(int flags)
 	 * (Perhaps it'd make even more sense to checkpoint only when the previous
 	 * checkpoint record is in a different xlog page?)
 	 *
-	 * While holding the WALInsertLock we find the current WAL insertion point
+	 * While holding insertpos_lck we find the current WAL insertion point
 	 * and compare that with the starting point of the last checkpoint, which
 	 * is the redo pointer. We use the redo pointer because the start and end
 	 * points of a checkpoint can be hundreds of files apart on large systems
@@ -7733,14 +8519,11 @@ CreateCheckPoint(int flags)
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		XLogRecPtr	curInsert;
-
-		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
 		XLByteToSeg(curInsert, insert_logSegNo);
 		XLByteToSeg(ControlFile->checkPointCopy.redo, redo_logSegNo);
 		if (insert_logSegNo == redo_logSegNo)
 		{
-			LWLockRelease(WALInsertLock);
+			WALInsertLockRelease(LW_EXCLUSIVE);
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
 			return;
@@ -7767,18 +8550,19 @@ CreateCheckPoint(int flags)
 	 * the buffer flush work.  Those XLOG records are logically after the
 	 * checkpoint, even though physically before it.  Got that?
 	 */
-	freespace = INSERT_FREESPACE(Insert);
+	freespace = INSERT_FREESPACE(curInsert);
 	if (freespace == 0)
 	{
-		(void) AdvanceXLInsertBuffer(false);
-		/* OK to ignore update return flag, since we will do flush anyway */
-		freespace = INSERT_FREESPACE(Insert);
+		if (curInsert.xrecoff % XLogSegSize == 0)
+			curInsert.xrecoff += SizeOfXLogLongPHD;
+		else
+			curInsert.xrecoff += SizeOfXLogShortPHD;
 	}
-	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
+	checkPoint.redo = curInsert;
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding the insert lock AND the info_lck.
+	 * must be done while holding the insert lock.
 	 *
 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
 	 * pointing past where it really needs to point.  This is okay; the only
@@ -7787,20 +8571,18 @@ CreateCheckPoint(int flags)
 	 * XLogInserts that happen while we are dumping buffers must assume that
 	 * their buffer changes are not included in the checkpoint.
 	 */
-	{
-		/* use volatile pointer to prevent code rearrangement */
-		volatile XLogCtlData *xlogctl = XLogCtl;
-
-		SpinLockAcquire(&xlogctl->info_lck);
-		RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
-		SpinLockRelease(&xlogctl->info_lck);
-	}
+	RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
 
 	/*
 	 * Now we can release WAL insert lock, allowing other xacts to proceed
 	 * while we are flushing disk buffers.
 	 */
-	LWLockRelease(WALInsertLock);
+	WALInsertLockRelease(LW_EXCLUSIVE);
+
+	/* Update the info_lck-protected copy of RedoRecPtr as well */
+	SpinLockAcquire(&xlogctl->info_lck);
+	xlogctl->RedoRecPtr = checkPoint.redo;
+	SpinLockRelease(&xlogctl->info_lck);
 
 	/*
 	 * If enabled, log checkpoint start.  We postpone this until now so as not
@@ -7932,7 +8714,9 @@ CreateCheckPoint(int flags)
 	 */
 	if (shutdown && !XLByteEQ(checkPoint.redo, ProcLastRecPtr))
 		ereport(PANIC,
-				(errmsg("concurrent transaction log activity while database system is shutting down")));
+				(errmsg("concurrent transaction log activity while database system is shutting down (%X/%X vs %X/%X",
+						checkPoint.redo.xlogid, checkPoint.redo.xrecoff,
+						ProcLastRecPtr.xlogid, ProcLastRecPtr.xrecoff)));
 
 	/*
 	 * Select point at which we can truncate the log, which we base on the
@@ -8185,15 +8969,18 @@ CreateRestartPoint(int flags)
 	 * the number of segments replayed since last restartpoint, and request a
 	 * restartpoint if it exceeds checkpoint_segments.
 	 *
-	 * You need to hold WALInsertLock and info_lck to update it, although
-	 * during recovery acquiring WALInsertLock is just pro forma, because
-	 * there is no other processes updating Insert.RedoRecPtr.
+	 * Like in CreatecheckPoint(), hold WALInsertLock to update it, although
+	 * during recovery acquiring insertpos_lck is just pro forma, because no
+	 * WAL insertions are happening.
 	 */
-	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	SpinLockAcquire(&xlogctl->info_lck);
+	WALInsertLockAcquire(LW_EXCLUSIVE);
 	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
+	WALInsertLockRelease(LW_EXCLUSIVE);
+
+	/* Also update the info_lck-protected copy */
+	SpinLockAcquire(&xlogctl->info_lck);
+	xlogctl->RedoRecPtr = lastCheckPoint.redo;
 	SpinLockRelease(&xlogctl->info_lck);
-	LWLockRelease(WALInsertLock);
 
 	/*
 	 * Prepare to accumulate statistics.
@@ -8461,7 +9248,7 @@ XLogReportParameters(void)
 void
 UpdateFullPageWrites(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
 
 	/*
 	 * Do nothing if full_page_writes has not been changed.
@@ -8484,9 +9271,9 @@ UpdateFullPageWrites(void)
 	 */
 	if (fullPageWrites)
 	{
-		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+		WALInsertLockAcquire(LW_EXCLUSIVE);
 		Insert->fullPageWrites = true;
-		LWLockRelease(WALInsertLock);
+		WALInsertLockRelease(LW_EXCLUSIVE);
 	}
 
 	/*
@@ -8507,9 +9294,9 @@ UpdateFullPageWrites(void)
 
 	if (!fullPageWrites)
 	{
-		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+		WALInsertLockAcquire(LW_EXCLUSIVE);
 		Insert->fullPageWrites = false;
-		LWLockRelease(WALInsertLock);
+		WALInsertLockRelease(LW_EXCLUSIVE);
 	}
 	END_CRIT_SECTION();
 }
@@ -9070,6 +9857,7 @@ XLogFileNameP(TimeLineID tli, XLogSegNo segno)
 XLogRecPtr
 do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 {
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
 	bool		exclusive = (labelfile == NULL);
 	bool		backup_started_in_recovery = false;
 	XLogRecPtr	checkpointloc;
@@ -9131,26 +9919,26 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 	 * Note that forcePageWrites has no effect during an online backup from
 	 * the standby.
 	 *
-	 * We must hold WALInsertLock to change the value of forcePageWrites, to
-	 * ensure adequate interlocking against XLogInsert().
+	 * We must hold WALInsertLock to change the value of forcePageWrites,
+	 * to ensure adequate interlocking against XLogInsert().
 	 */
-	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+	WALInsertLockAcquire(LW_EXCLUSIVE);
 	if (exclusive)
 	{
-		if (XLogCtl->Insert.exclusiveBackup)
+		if (Insert->exclusiveBackup)
 		{
-			LWLockRelease(WALInsertLock);
+			WALInsertLockRelease(LW_EXCLUSIVE);
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("a backup is already in progress"),
 					 errhint("Run pg_stop_backup() and try again.")));
 		}
-		XLogCtl->Insert.exclusiveBackup = true;
+		Insert->exclusiveBackup = true;
 	}
 	else
-		XLogCtl->Insert.nonExclusiveBackups++;
-	XLogCtl->Insert.forcePageWrites = true;
-	LWLockRelease(WALInsertLock);
+		Insert->nonExclusiveBackups++;
+	Insert->forcePageWrites = true;
+	WALInsertLockRelease(LW_EXCLUSIVE);
 
 	/* Ensure we release forcePageWrites if fail below */
 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
@@ -9263,13 +10051,13 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 			 * taking a checkpoint right after another is not that expensive
 			 * either because only few buffers have been dirtied yet.
 			 */
-			LWLockAcquire(WALInsertLock, LW_SHARED);
-			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
+			WALInsertLockAcquire(LW_EXCLUSIVE);
+			if (XLByteLT(Insert->lastBackupStart, startpoint))
 			{
-				XLogCtl->Insert.lastBackupStart = startpoint;
+				Insert->lastBackupStart = startpoint;
 				gotUniqueStartpoint = true;
 			}
-			LWLockRelease(WALInsertLock);
+			WALInsertLockRelease(LW_EXCLUSIVE);
 		} while (!gotUniqueStartpoint);
 
 		XLByteToSeg(startpoint, _logSegNo);
@@ -9353,27 +10141,28 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 static void
 pg_start_backup_callback(int code, Datum arg)
 {
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
 	bool		exclusive = DatumGetBool(arg);
 
 	/* Update backup counters and forcePageWrites on failure */
-	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+	WALInsertLockAcquire(LW_EXCLUSIVE);
 	if (exclusive)
 	{
-		Assert(XLogCtl->Insert.exclusiveBackup);
-		XLogCtl->Insert.exclusiveBackup = false;
+		Assert(Insert->exclusiveBackup);
+		Insert->exclusiveBackup = false;
 	}
 	else
 	{
-		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
-		XLogCtl->Insert.nonExclusiveBackups--;
+		Assert(Insert->nonExclusiveBackups > 0);
+		Insert->nonExclusiveBackups--;
 	}
 
-	if (!XLogCtl->Insert.exclusiveBackup &&
-		XLogCtl->Insert.nonExclusiveBackups == 0)
+	if (!Insert->exclusiveBackup &&
+		Insert->nonExclusiveBackups == 0)
 	{
-		XLogCtl->Insert.forcePageWrites = false;
+		Insert->forcePageWrites = false;
 	}
-	LWLockRelease(WALInsertLock);
+	WALInsertLockRelease(LW_EXCLUSIVE);
 }
 
 /*
@@ -9386,6 +10175,7 @@ pg_start_backup_callback(int code, Datum arg)
 XLogRecPtr
 do_pg_stop_backup(char *labelfile, bool waitforarchive)
 {
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
 	bool		exclusive = (labelfile == NULL);
 	bool		backup_started_in_recovery = false;
 	XLogRecPtr	startpoint;
@@ -9438,9 +10228,9 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	/*
 	 * OK to update backup counters and forcePageWrites
 	 */
-	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+	WALInsertLockAcquire(LW_EXCLUSIVE);
 	if (exclusive)
-		XLogCtl->Insert.exclusiveBackup = false;
+		Insert->exclusiveBackup = false;
 	else
 	{
 		/*
@@ -9449,16 +10239,16 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 		 * backups, it is expected that each do_pg_start_backup() call is
 		 * matched by exactly one do_pg_stop_backup() call.
 		 */
-		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
-		XLogCtl->Insert.nonExclusiveBackups--;
+		Assert(Insert->nonExclusiveBackups > 0);
+		Insert->nonExclusiveBackups--;
 	}
 
-	if (!XLogCtl->Insert.exclusiveBackup &&
-		XLogCtl->Insert.nonExclusiveBackups == 0)
+	if (!Insert->exclusiveBackup &&
+		Insert->nonExclusiveBackups == 0)
 	{
-		XLogCtl->Insert.forcePageWrites = false;
+		Insert->forcePageWrites = false;
 	}
-	LWLockRelease(WALInsertLock);
+	WALInsertLockRelease(LW_EXCLUSIVE);
 
 	if (exclusive)
 	{
@@ -9736,16 +10526,18 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 void
 do_pg_abort_backup(void)
 {
-	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
-	XLogCtl->Insert.nonExclusiveBackups--;
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+
+	WALInsertLockAcquire(LW_EXCLUSIVE);
+	Assert(Insert->nonExclusiveBackups > 0);
+	Insert->nonExclusiveBackups--;
 
-	if (!XLogCtl->Insert.exclusiveBackup &&
-		XLogCtl->Insert.nonExclusiveBackups == 0)
+	if (!Insert->exclusiveBackup &&
+		Insert->nonExclusiveBackups == 0)
 	{
-		XLogCtl->Insert.forcePageWrites = false;
+		Insert->forcePageWrites = false;
 	}
-	LWLockRelease(WALInsertLock);
+	WALInsertLockRelease(LW_EXCLUSIVE);
 }
 
 /*
@@ -9799,14 +10591,14 @@ GetStandbyFlushRecPtr(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	XLogRecPtr	current_recptr;
+	volatile XLogCtlInsert *Insert = &XLogCtl->Insert;
+	uint64		current_bytepos;
 
-	LWLockAcquire(WALInsertLock, LW_SHARED);
-	INSERT_RECPTR(current_recptr, Insert, Insert->curridx);
-	LWLockRelease(WALInsertLock);
+	SpinLockAcquire(&Insert->insertpos_lck);
+	current_bytepos = Insert->CurrBytePos;
+	SpinLockRelease(&Insert->insertpos_lck);
 
-	return current_recptr;
+	return XLogBytePosToRecPtr(current_bytepos);
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 26469c4..8d6567f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1753,9 +1753,10 @@ GetOldestActiveTransactionId(void)
  * the result is somewhat indeterminate, but we don't really care.  Even in
  * a multiprocessor with delayed writes to shared memory, it should be certain
  * that setting of inCommit will propagate to shared memory when the backend
- * takes the WALInsertLock, so we cannot fail to see an xact as inCommit if
- * it's already inserted its commit record.  Whether it takes a little while
- * for clearing of inCommit to propagate is unimportant for correctness.
+ * takes a lock to write the WAL record, so we cannot fail to see an xact as
+ * inCommit if it's already inserted its commit record.  Whether it takes a
+ * little while for clearing of inCommit to propagate is unimportant for
+ * correctness.
  */
 int
 GetTransactionsInCommit(TransactionId **xids_p)
diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index d262efa..479ef9a 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -56,6 +56,9 @@ SpinlockSemas(void)
 	 *
 	 * For now, though, we just need a few spinlocks (10 should be plenty)
 	 * plus one for each LWLock and one for each buffer header.
+	 *
+	 * XXX: remember to adjust this for the number of spinlocks needed by the
+	 * xlog.c changes before committing!
 	 */
 	return NumLWLocks() + NBuffers + 10;
 }
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index a958856..03f854e 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -163,8 +163,7 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 
 /* Check if an xrecoff value is in a plausible range */
 #define XRecOffIsValid(xrecoff) \
-		((xrecoff) % XLOG_BLCKSZ >= SizeOfXLogShortPHD && \
-		(XLOG_BLCKSZ - (xrecoff) % XLOG_BLCKSZ) >= SizeOfXLogRecord)
+		((xrecoff) % XLOG_BLCKSZ >= SizeOfXLogShortPHD)
 
 /*
  * The XLog directory and control file (relative to $PGDATA)
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index ac45ee6..2883549 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -247,7 +247,7 @@
  * Enable debugging print statements for WAL-related operations; see
  * also the wal_debug GUC var.
  */
-/* #define WAL_DEBUG */
+#define WAL_DEBUG
 
 /*
  * Enable tracing of resource consumption during sort operations;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 6b59efc..3d9d5d9 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -53,7 +53,7 @@ typedef enum LWLockId
 	ProcArrayLock,
 	SInvalReadLock,
 	SInvalWriteLock,
-	WALInsertLock,
+	WALBufMappingLock,
 	WALWriteLock,
 	ControlFileLock,
 	CheckpointLock,
@@ -79,6 +79,15 @@ typedef enum LWLockId
 	SerializablePredicateLockListLock,
 	OldSerXidLock,
 	SyncRepLock,
+	WALInsertTailLock,
+	FirstWALInsertShareLock,
+	WALInsertShareLock2,
+	WALInsertShareLock3,
+	WALInsertShareLock4,
+	WALInsertShareLock5,
+	WALInsertShareLock6,
+	WALInsertShareLock7,
+	LastWALInsertShareLock,
 	/* Individual lock IDs end here */
 	FirstBufMappingLock,
 	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: WAL format changes

On Thursday, June 14, 2012 11:01:42 PM Heikki Linnakangas wrote:

As I threatened earlier
(http://archives.postgresql.org/message-id/4FD0B1AB.3090405@enterprisedb.co
m), here are three patches that change the WAL format. The goal is to
change the format so that when you're inserting a WAL record of a given
size, you know exactly how much space it requires in the WAL.

I fear the patches need rebasing after the pgindent run... Even before that
(60801944fa105252b48ea5688d47dfc05c695042) it only applies with offsets?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: WAL format changes

On Thursday, June 14, 2012 11:01:42 PM Heikki Linnakangas wrote:

As I threatened earlier
(http://archives.postgresql.org/message-id/4FD0B1AB.3090405@enterprisedb.co
m), here are three patches that change the WAL format. The goal is to
change the format so that when you're inserting a WAL record of a given
size, you know exactly how much space it requires in the WAL.

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't that many
users of xlogid/xcrecoff and many of those would be better served by using
helper macros.
API compatibility isn't a great argument either as code manually playing
around with those needs to be checked anyway. I think there might be some code
around that does XLogRecPtr addition manuall and such.

2. Always include the xl_rem_len field, used for continuation records,
in the xlog page header. A continuation log record only contained that
one field, it's now included straight in the page header, so the concept
of a continuation record doesn't exist anymore. Because of alignment,
this wastes 4 bytes on every page that contains continued data from a
previous record, and 8 bytes on pages that don't. That's not very much,
and the next step will buy that back:

3. Allow WAL record header to be split across pages. Per Tom's
suggestion, move xl_tot_len to be the first field in XLogRecord, so that
even if the header is split, xl_tot_len is always on the first page.
xl_crc is moved to be the last field, and xl_prev is the second to last.
This has the advantage that you can calculate the CRC for all the other
fields before acquiring WALInsertLock. For xl_prev, you need to know
where exactly the record is inserted, so it's handy that it's the last
field before CRC. This patch doesn't try to take advantage of that,
however, and I'm not sure if that makes any difference once I finish the
patch to make XLogInsert scale better, which is the ultimate goal of all
this.

Those are the three patches I'd like to get committed in this
commitfest. To see where all this is leading to, I've included a rough
WIP version of the XLogInsert scaling patch. This version is quite
different from the one I posted in spring, it takes advantage of the WAL
format changes, and I'm also experimenting with a different method of
tracking how far each WAL insertion has progressed. But more on that later.

(Note to self: remember to bump XLOG_PAGE_MAGIC)

Will review.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Andres Freund (#3)

Re: WAL format changes

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't that many
users of xlogid/xcrecoff and many of those would be better served by using
helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 13 years ago

In reply to: Robert Haas (#4)

Re: WAL format changes

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com> wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't that many
users of xlogid/xcrecoff and many of those would be better served by using
helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Heikki Linnakangas (#5)

Re: WAL format changes

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com>

wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is
unchanged, but it should now be understood as a plain 64-bit value,
just split into two 32-bit integers for historical reasons. On disk,
this means that there will be log files ending in FF, those were
skipped before.

Whats the reason for keeping that awkward split now? There aren't that
many users of xlogid/xcrecoff and many of those would be better served
by using helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without
marking the page dirty?

There is the valid argument that you would loose some information when pages
with hint bits are written out again, but on the other hand you would also
gain the information that it was a hint-bit write...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Heikki Linnakangas (#5)

Re: WAL format changes

On Mon, Jun 18, 2012 at 2:08 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com>
wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't that
many
users of xlogid/xcrecoff and many of those would be better served by
using
helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll have
to deal with pg_upgrade. I guess we could still keep XLogRecPtr around as
the on-disk representation, and convert between the 64-bit integer and
XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out - many xlog
calculations would admittedly be simpler if it was an uint64.

Ugh. Well, that's a good reason for thinking twice before changing
it, if not abandoning the idea altogether.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 13 years ago

In reply to: Andres Freund (#6)

Re: WAL format changes

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr without
marking the page dirty?

There is the valid argument that you would loose some information when pages
with hint bits are written out again, but on the other hand you would also
gain the information that it was a hint-bit write...

Sorry, I don't understand that. Where would you "replace the LSN from
disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Heikki Linnakangas (#8)

Re: WAL format changes

On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr
without marking the page dirty?

There is the valid argument that you would loose some information when
pages with hint bits are written out again, but on the other hand you
would also gain the information that it was a hint-bit write...

Sorry, I don't understand that. Where would you "replace the LSN from
disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )

In ReadBuffer_common or such, after reading a page from disk and verifying
that the page has a valid header it seems to be possible to replace pd_lsn *in
memory* by InvalidXLogRecPtr without marking the page dirty.
If the page isn't modified the lsn on disk won't be changed so you don't loose
debugging information in that case. If will be zero if it has been written by
a hint-bit write and thats arguable a win.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 13 years ago

In reply to: Andres Freund (#9)

Re: WAL format changes

On 18.06.2012 21:45, Andres Freund wrote:

On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

I am out of my depth here, not having read any of the relevant code, but
couldn't we simply replace the lsn from disk with InvalidXLogRecPtr
without marking the page dirty?

There is the valid argument that you would loose some information when
pages with hint bits are written out again, but on the other hand you
would also gain the information that it was a hint-bit write...

Sorry, I don't understand that. Where would you "replace the LSN from
disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )

In ReadBuffer_common or such, after reading a page from disk and verifying
that the page has a valid header it seems to be possible to replace pd_lsn *in
memory* by InvalidXLogRecPtr without marking the page dirty.
If the page isn't modified the lsn on disk won't be changed so you don't loose
debugging information in that case. If will be zero if it has been written by
a hint-bit write and thats arguable a win.

We use the LSN to decide whether a full-page image need to be xlogged if
the page is modified. If you reset LSN every time you read in a page,
you'll be doing full page writes every time a page is read from disk and
modified, whether or not it's the first time the page is modified after
the last checkpoint.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Heikki Linnakangas (#10)

Re: WAL format changes

On Monday, June 18, 2012 09:19:48 PM Heikki Linnakangas wrote:

On 18.06.2012 21:45, Andres Freund wrote:

On Monday, June 18, 2012 08:32:54 PM Heikki Linnakangas wrote:

On 18.06.2012 21:13, Andres Freund wrote:

On Monday, June 18, 2012 08:08:14 PM Heikki Linnakangas wrote:

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an
uint64.

I am out of my depth here, not having read any of the relevant code,
but couldn't we simply replace the lsn from disk with
InvalidXLogRecPtr without marking the page dirty?

There is the valid argument that you would loose some information when
pages with hint bits are written out again, but on the other hand you
would also gain the information that it was a hint-bit write...

Sorry, I don't understand that. Where would you "replace the LSN from
disk with InvalidXLogRecPtr" ? (and no, it probably won't work ;-) )

In ReadBuffer_common or such, after reading a page from disk and
verifying that the page has a valid header it seems to be possible to
replace pd_lsn *in memory* by InvalidXLogRecPtr without marking the page
dirty.
If the page isn't modified the lsn on disk won't be changed so you don't
loose debugging information in that case. If will be zero if it has been
written by a hint-bit write and thats arguable a win.

We use the LSN to decide whether a full-page image need to be xlogged if
the page is modified. If you reset LSN every time you read in a page,
you'll be doing full page writes every time a page is read from disk and
modified, whether or not it's the first time the page is modified after
the last checkpoint.

Yea, I somehow disregarded that hint-bit writes would make a problem with full
page writes in that case. Normal writes wouldn't be a problem...
This should be fixable but the result would be too ugly. So its either
converting the on-disk representation or keeping the duplicated
representation.

pd_lsn seems to be well enough abstracted, doing the conversion in
PageSet/GetLSN seems to be easy enough.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 13 years ago

In reply to: Heikki Linnakangas (#5)

1 attachment(s)

Re: WAL format changes

On 18.06.2012 21:08, Heikki Linnakangas wrote:

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com>
wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is
unchanged,
but it should now be understood as a plain 64-bit value, just split
into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't
that many
users of xlogid/xcrecoff and many of those would be better served by
using
helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

The LSNs on pages are still stored in the old format, to avoid changing
the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the
control file and WAL are changed, however, so an initdb (or at least
pg_resetxlog) is required.

Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol. That would make it easier
to write a client that works across different architectures, in >= 9.3.
The contents of the WAL would of course be architecture-dependent, but
it would be nice if pg_receivexlog and similar tools could nevertheless
be architecture-independent.

I kept the %X/%X representation in error messages etc. I'm quite used to
that output, so reluctant to change it, although it's a bit silly now
that it represents just 64-bit value. Using UINT64_FORMAT would also
make the messages harder to translate.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

xlogrecptr-uint64-1.patchtext/x-diff; name=xlogrecptr-uint64-1.patchDownload

diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
index f51a4e3..e8a7940 100644
--- a/contrib/pageinspect/rawpage.c
+++ b/contrib/pageinspect/rawpage.c
@@ -206,7 +206,8 @@ page_header(PG_FUNCTION_ARGS)
 	/* Extract information from the page header */
 
 	lsn = PageGetLSN(page);
-	snprintf(lsnchar, sizeof(lsnchar), "%X/%X", lsn.xlogid, lsn.xrecoff);
+	snprintf(lsnchar, sizeof(lsnchar), "%X/%X",
+			 (uint32) (lsn >> 32), (uint32) lsn);
 
 	values[0] = CStringGetTextDatum(lsnchar);
 	values[1] = UInt16GetDatum(PageGetTLI(page));
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 1efaaee..c6e3baa 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -197,7 +197,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		SplitedPageLayout *dist = NULL,
 				   *ptr;
 		BlockNumber oldrlink = InvalidBlockNumber;
-		GistNSN		oldnsn = {0, 0};
+		GistNSN		oldnsn = 0;
 		SplitedPageLayout rootpg;
 		BlockNumber blkno = BufferGetBlockNumber(buffer);
 		bool		is_rootsplit;
@@ -488,7 +488,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
-	firststack.lsn.xrecoff = 0;
+	firststack.lsn = 0;
 	firststack.parent = NULL;
 	firststack.downlinkoffnum = InvalidOffsetNumber;
 	state.stack = stack = &firststack;
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8039b5d..df1e2e3 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -706,13 +706,7 @@ gistoptions(PG_FUNCTION_ARGS)
 XLogRecPtr
 GetXLogRecPtrForTemp(void)
 {
-	static XLogRecPtr counter = {0, 1};
-
-	counter.xrecoff++;
-	if (counter.xrecoff == 0)
-	{
-		counter.xlogid++;
-		counter.xrecoff++;
-	}
+	static XLogRecPtr counter = 1;
+	counter++;
 	return counter;
 }
diff --git a/src/backend/access/transam/transam.c b/src/backend/access/transam/transam.c
index a7214cf..ff9dd4b 100644
--- a/src/backend/access/transam/transam.c
+++ b/src/backend/access/transam/transam.c
@@ -24,9 +24,6 @@
 #include "access/transam.h"
 #include "utils/snapmgr.h"
 
-/* Handy constant for an invalid xlog recptr */
-const XLogRecPtr InvalidXLogRecPtr = {0, 0};
-
 /*
  * Single-item cache for results of TransactionLogFetch.  It's worth having
  * such a cache because we frequently find ourselves repeatedly checking the
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6db46c0..de51adf 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -333,8 +333,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 
 	gxact->prepared_at = prepared_at;
 	/* initialize LSN to 0 (start of WAL) */
-	gxact->prepare_lsn.xlogid = 0;
-	gxact->prepare_lsn.xrecoff = 0;
+	gxact->prepare_lsn = 0;
 	gxact->owner = owner;
 	gxact->locking_xid = xid;
 	gxact->valid = false;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 49c14cb..09c45c8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -951,7 +951,7 @@ RecordTransactionCommit(void)
 	if (XLogStandbyInfoActive())
 		nmsgs = xactGetCommittedInvalidationMessages(&invalMessages,
 													 &RelcacheInitFileInval);
-	wrote_xlog = (XactLastRecEnd.xrecoff != 0);
+	wrote_xlog = (XactLastRecEnd != 0);
 
 	/*
 	 * If we haven't been assigned an XID yet, we neither can, nor do we want
@@ -1198,7 +1198,7 @@ RecordTransactionCommit(void)
 		SyncRepWaitForLSN(XactLastRecEnd);
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
-	XactLastRecEnd.xrecoff = 0;
+	XactLastRecEnd = 0;
 
 cleanup:
 	/* Clean up local data */
@@ -1400,7 +1400,7 @@ RecordTransactionAbort(bool isSubXact)
 	{
 		/* Reset XactLastRecEnd until the next transaction writes something */
 		if (!isSubXact)
-			XactLastRecEnd.xrecoff = 0;
+			XactLastRecEnd = 0;
 		return InvalidTransactionId;
 	}
 
@@ -1499,7 +1499,7 @@ RecordTransactionAbort(bool isSubXact)
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	if (!isSubXact)
-		XactLastRecEnd.xrecoff = 0;
+		XactLastRecEnd = 0;
 
 	/* And clean up local data */
 	if (rels)
@@ -2165,7 +2165,7 @@ PrepareTransaction(void)
 	 */
 
 	/* Reset XactLastRecEnd until the next transaction writes something */
-	XactLastRecEnd.xrecoff = 0;
+	XactLastRecEnd = 0;
 
 	/*
 	 * Let others know about no transaction in progress by me.	This has to be
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3f5e0b2..95224a8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -252,9 +252,9 @@ static TimeLineID curFileTLI;
  * or start a new one; so it can be used to tell if the current transaction has
  * created any XLOG records.
  */
-static XLogRecPtr ProcLastRecPtr = {0, 0};
+static XLogRecPtr ProcLastRecPtr = InvalidXLogRecPtr;
 
-XLogRecPtr	XactLastRecEnd = {0, 0};
+XLogRecPtr	XactLastRecEnd = InvalidXLogRecPtr;
 
 /*
  * RedoRecPtr is this backend's local copy of the REDO record pointer
@@ -278,7 +278,7 @@ static XLogRecPtr RedoRecPtr;
  * backwards to the REDO location after reading the checkpoint record,
  * because the REDO record can precede the checkpoint record.
  */
-static XLogRecPtr RedoStartLSN = {0, 0};
+static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
 
 /*----------
  * Shared-memory data structures for XLOG control
@@ -490,13 +490,7 @@ static ControlFileData *ControlFile = NULL;
 
 /* Construct XLogRecPtr value for current insertion point */
 #define INSERT_RECPTR(recptr,Insert,curridx)  \
-	do {																\
-		(recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid;			\
-		(recptr).xrecoff =												\
-			XLogCtl->xlblocks[curridx].xrecoff - INSERT_FREESPACE(Insert); \
-		if (XLogCtl->xlblocks[curridx].xrecoff == 0)					\
-			(recptr).xlogid = XLogCtl->xlblocks[curridx].xlogid - 1;	\
-	} while(0)
+		(recptr) = XLogCtl->xlblocks[curridx] - INSERT_FREESPACE(Insert)
 
 #define PrevBufIdx(idx)		\
 		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
@@ -508,7 +502,7 @@ static ControlFileData *ControlFile = NULL;
  * Private, possibly out-of-date copy of shared LogwrtResult.
  * See discussion above.
  */
-static XLogwrtResult LogwrtResult = {{0, 0}, {0, 0}};
+static XLogwrtResult LogwrtResult = {0, 0};
 
 /*
  * Codes indicating where we got a WAL file from during recovery, or where
@@ -742,8 +736,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	 */
 	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
 	{
-		RecPtr.xlogid = 0;
-		RecPtr.xrecoff = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
+		RecPtr = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
 		return RecPtr;
 	}
 
@@ -1008,13 +1001,12 @@ begin:;
 	 * everything is written and flushed through the end of the prior segment,
 	 * and return the prior segment's end address.
 	 */
-	if (isLogSwitch &&
-		(RecPtr.xrecoff % XLogSegSize) == SizeOfXLogLongPHD)
+	if (isLogSwitch && (RecPtr % XLogSegSize) == SizeOfXLogLongPHD)
 	{
 		/* We can release insert lock immediately */
 		LWLockRelease(WALInsertLock);
 
-		RecPtr.xrecoff -= SizeOfXLogLongPHD;
+		RecPtr -= SizeOfXLogLongPHD;
 
 		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 		LogwrtResult = XLogCtl->LogwrtResult;
@@ -1048,7 +1040,7 @@ begin:;
 
 		initStringInfo(&buf);
 		appendStringInfo(&buf, "INSERT @ %X/%X: ",
-						 RecPtr.xlogid, RecPtr.xrecoff);
+						 (uint32) (RecPtr >> 32), (uint32) RecPtr);
 		xlog_outrec(&buf, rechdr);
 		if (rdata->data != NULL)
 		{
@@ -1149,12 +1141,7 @@ begin:;
 
 		/* Compute end address of old segment */
 		OldSegEnd = XLogCtl->xlblocks[curridx];
-		if (OldSegEnd.xrecoff == 0)
-		{
-			/* crossing a logid boundary */
-			OldSegEnd.xlogid -= 1;
-		}
-		OldSegEnd.xrecoff -= XLOG_BLCKSZ;
+		OldSegEnd -= XLOG_BLCKSZ;
 
 		/* Make it look like we've written and synced all of old segment */
 		LogwrtResult.Write = OldSegEnd;
@@ -1520,8 +1507,7 @@ AdvanceXLInsertBuffer(bool new_segment)
 				 */
 				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
 				WriteRqst.Write = OldPageRqstPtr;
-				WriteRqst.Flush.xlogid = 0;
-				WriteRqst.Flush.xrecoff = 0;
+				WriteRqst.Flush = 0;
 				XLogWrite(WriteRqst, false, false);
 				LWLockRelease(WALWriteLock);
 				TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
@@ -1538,9 +1524,9 @@ AdvanceXLInsertBuffer(bool new_segment)
 	if (new_segment)
 	{
 		/* force it to a segment start point */
-		if (NewPageBeginPtr.xrecoff % XLogSegSize != 0)
+		if (NewPageBeginPtr % XLogSegSize != 0)
 			XLByteAdvance(NewPageBeginPtr,
-						  XLogSegSize - NewPageBeginPtr.xrecoff % XLogSegSize);
+						  XLogSegSize - NewPageBeginPtr % XLogSegSize);
 	}
 
 	NewPageEndPtr = NewPageBeginPtr;
@@ -1586,7 +1572,7 @@ AdvanceXLInsertBuffer(bool new_segment)
 	/*
 	 * If first page of an XLOG segment file, make it a long header.
 	 */
-	if ((NewPage->xlp_pageaddr.xrecoff % XLogSegSize) == 0)
+	if ((NewPage->xlp_pageaddr % XLogSegSize) == 0)
 	{
 		XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
 
@@ -1690,9 +1676,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 		 */
 		if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[curridx]))
 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
-				 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
-				 XLogCtl->xlblocks[curridx].xlogid,
-				 XLogCtl->xlblocks[curridx].xrecoff);
+				 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
+				 (uint32) (XLogCtl->xlblocks[curridx] >> 32),
+				 (uint32) XLogCtl->xlblocks[curridx]);
 
 		/* Advance LogwrtResult.Write to end of current buffer page */
 		LogwrtResult.Write = XLogCtl->xlblocks[curridx];
@@ -1728,7 +1714,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 		{
 			/* first of group */
 			startidx = curridx;
-			startoffset = (LogwrtResult.Write.xrecoff - XLOG_BLCKSZ) % XLogSegSize;
+			startoffset = (LogwrtResult.Write - XLOG_BLCKSZ) % XLogSegSize;
 		}
 		npages++;
 
@@ -1920,7 +1906,7 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
 	if (!sleeping)
 	{
 		/* back off to last completed page boundary */
-		WriteRqstPtr.xrecoff -= WriteRqstPtr.xrecoff % XLOG_BLCKSZ;
+		WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
 
 		/* if we have already flushed that far, we're done */
 		if (XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
@@ -1962,7 +1948,7 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	 * i.e., we're doing crash recovery.  We never modify the control file's
 	 * value in that case, so we can short-circuit future checks here too.
 	 */
-	if (minRecoveryPoint.xlogid == 0 && minRecoveryPoint.xrecoff == 0)
+	if (minRecoveryPoint == 0)
 		updateMinRecoveryPoint = false;
 	else if (force || XLByteLT(minRecoveryPoint, lsn))
 	{
@@ -1990,8 +1976,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 		if (!force && XLByteLT(newMinRecoveryPoint, lsn))
 			elog(WARNING,
 			   "xlog min recovery request %X/%X is past current point %X/%X",
-				 lsn.xlogid, lsn.xrecoff,
-				 newMinRecoveryPoint.xlogid, newMinRecoveryPoint.xrecoff);
+				 (uint32) (lsn >> 32) , (uint32) lsn,
+				 (uint32) (newMinRecoveryPoint >> 32),
+				 (uint32) newMinRecoveryPoint);
 
 		/* update control file */
 		if (XLByteLT(ControlFile->minRecoveryPoint, newMinRecoveryPoint))
@@ -2002,7 +1989,8 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 
 			ereport(DEBUG2,
 					(errmsg("updated min recovery point to %X/%X",
-						minRecoveryPoint.xlogid, minRecoveryPoint.xrecoff)));
+							(uint32) (minRecoveryPoint >> 32),
+							(uint32) minRecoveryPoint)));
 		}
 	}
 	LWLockRelease(ControlFileLock);
@@ -2040,9 +2028,9 @@ XLogFlush(XLogRecPtr record)
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 		elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
-			 record.xlogid, record.xrecoff,
-			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
-			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
+			 (uint32) (record >> 32), (uint32) record,
+			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
+			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
 #endif
 
 	START_CRIT_SECTION();
@@ -2109,9 +2097,7 @@ XLogFlush(XLogRecPtr record)
 				else
 				{
 					WriteRqstPtr = XLogCtl->xlblocks[Insert->curridx];
-					if (WriteRqstPtr.xrecoff == 0)
-						WriteRqstPtr.xlogid--;
-					WriteRqstPtr.xrecoff -= freespace;
+					WriteRqstPtr -= freespace;
 				}
 				LWLockRelease(WALInsertLock);
 				WriteRqst.Write = WriteRqstPtr;
@@ -2155,8 +2141,8 @@ XLogFlush(XLogRecPtr record)
 	if (XLByteLT(LogwrtResult.Flush, record))
 		elog(ERROR,
 		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
-			 record.xlogid, record.xrecoff,
-			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
+			 (uint32) (record >> 32), (uint32) record,
+			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
 }
 
 /*
@@ -2199,7 +2185,7 @@ XLogBackgroundFlush(void)
 	}
 
 	/* back off to last completed page boundary */
-	WriteRqstPtr.xrecoff -= WriteRqstPtr.xrecoff % XLOG_BLCKSZ;
+	WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
 
 	/* if we have already flushed that far, consider async commit records */
 	if (XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
@@ -2233,9 +2219,9 @@ XLogBackgroundFlush(void)
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 		elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X",
-			 WriteRqstPtr.xlogid, WriteRqstPtr.xrecoff,
-			 LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
-			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
+			 (uint32) (WriteRqstPtr >> 32), (uint32) WriteRqstPtr,
+			 (uint32) (LogwrtResult.Write >> 32), (uint32) LogwrtResult.Write,
+			 (uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
 #endif
 
 	START_CRIT_SECTION();
@@ -2294,7 +2280,7 @@ XLogNeedsFlush(XLogRecPtr record)
 		 * file's value in that case, so we can short-circuit future checks
 		 * here too.
 		 */
-		if (minRecoveryPoint.xlogid == 0 && minRecoveryPoint.xrecoff == 0)
+		if (minRecoveryPoint == 0)
 			updateMinRecoveryPoint = false;
 
 		/* check again */
@@ -3308,8 +3294,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	bool		use_existent;
 
 	XLByteToPrevSeg(endptr, _logSegNo);
-	if ((endptr.xrecoff - 1) % XLogSegSize >=
-		(uint32) (0.75 * XLogSegSize))
+	if ((endptr - 1) % XLogSegSize >= (uint32) (0.75 * XLogSegSize))
 	{
 		_logSegNo++;
 		use_existent = true;
@@ -3693,7 +3678,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
 		{
 			ereport(emode_for_corrupt_record(emode, recptr),
 					(errmsg("incorrect hole size in record at %X/%X",
-							recptr.xlogid, recptr.xrecoff)));
+							(uint32) (recptr >> 32), (uint32) recptr)));
 			return false;
 		}
 		blen = sizeof(BkpBlock) + BLCKSZ - bkpb.hole_length;
@@ -3706,7 +3691,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
 	{
 		ereport(emode_for_corrupt_record(emode, recptr),
 				(errmsg("incorrect total length in record at %X/%X",
-						recptr.xlogid, recptr.xrecoff)));
+						(uint32) (recptr >> 32), (uint32) recptr)));
 		return false;
 	}
 
@@ -3718,7 +3703,7 @@ RecordIsValid(XLogRecord *record, XLogRecPtr recptr, int emode)
 	{
 		ereport(emode_for_corrupt_record(emode, recptr),
 		(errmsg("incorrect resource manager data checksum in record at %X/%X",
-				recptr.xlogid, recptr.xrecoff)));
+				(uint32) (recptr >> 32), (uint32) recptr)));
 		return false;
 	}
 
@@ -3783,10 +3768,10 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 		 * In this case, the passed-in record pointer should already be
 		 * pointing to a valid record starting position.
 		 */
-		if (!XRecOffIsValid(RecPtr->xrecoff))
+		if (!XRecOffIsValid(*RecPtr))
 			ereport(PANIC,
 					(errmsg("invalid record offset at %X/%X",
-							RecPtr->xlogid, RecPtr->xrecoff)));
+							(uint32) (*RecPtr >> 32), (uint32) *RecPtr)));
 
 		/*
 		 * Since we are going to a random position in WAL, forget any prior
@@ -3807,7 +3792,7 @@ retry:
 		return NULL;
 
 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
-	targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;
+	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
 	if (targetRecOff == 0)
 	{
 		/*
@@ -3817,14 +3802,14 @@ retry:
 		 * XRecOffIsValid rejected the zero-page-offset case otherwise.
 		 */
 		Assert(RecPtr == &tmpRecPtr);
-		RecPtr->xrecoff += pageHeaderSize;
+		(*RecPtr) += pageHeaderSize;
 		targetRecOff = pageHeaderSize;
 	}
 	else if (targetRecOff < pageHeaderSize)
 	{
 		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("invalid record offset at %X/%X",
-						RecPtr->xlogid, RecPtr->xrecoff)));
+						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
 		goto next_record_is_invalid;
 	}
 	if ((((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
@@ -3832,7 +3817,7 @@ retry:
 	{
 		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("contrecord is requested by %X/%X",
-						RecPtr->xlogid, RecPtr->xrecoff)));
+						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
 		goto next_record_is_invalid;
 	}
 
@@ -3842,7 +3827,7 @@ retry:
 	 * struct, so it must be on this page, but we cannot safely access any
 	 * other fields yet.
 	 */
-	record = (XLogRecord *) (readBuf + RecPtr->xrecoff % XLOG_BLCKSZ);
+	record = (XLogRecord *) (readBuf + (*RecPtr) % XLOG_BLCKSZ);
 	total_len = record->xl_tot_len;
 
 	/* Make sure the record buffer can hold the whole record. */
@@ -3870,7 +3855,7 @@ retry:
 			/* We treat this as a "bogus data" condition */
 			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errmsg("record length %u at %X/%X too long",
-							total_len, RecPtr->xlogid, RecPtr->xrecoff)));
+							total_len, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
 			goto next_record_is_invalid;
 		}
 		readRecordBufSize = newSize;
@@ -3889,7 +3874,7 @@ retry:
 	else
 		gotheader = false;
 
-	len = XLOG_BLCKSZ - RecPtr->xrecoff % XLOG_BLCKSZ;
+	len = XLOG_BLCKSZ - (*RecPtr) % XLOG_BLCKSZ;
 	if (total_len > len)
 	{
 		/* Need to reassemble record */
@@ -3900,11 +3885,10 @@ retry:
 		uint32		gotlen;
 
 		/* Initialize pagelsn to the beginning of the page this record is on */
-		pagelsn = *RecPtr;
-		pagelsn.xrecoff = (pagelsn.xrecoff / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+		pagelsn = ((*RecPtr) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
 
 		/* Copy the first fragment of the record from the first page. */
-		memcpy(readRecordBuf, readBuf + RecPtr->xrecoff % XLOG_BLCKSZ, len);
+		memcpy(readRecordBuf, readBuf + (*RecPtr) % XLOG_BLCKSZ, len);
 		buffer = readRecordBuf + len;
 		gotlen = len;
 
@@ -3976,8 +3960,7 @@ retry:
 		/* Record does not cross a page boundary */
 		if (!RecordIsValid(record, *RecPtr, emode))
 			goto next_record_is_invalid;
-		EndRecPtr.xlogid = RecPtr->xlogid;
-		EndRecPtr.xrecoff = RecPtr->xrecoff + MAXALIGN(total_len);
+		EndRecPtr = *RecPtr + MAXALIGN(total_len);
 
 		ReadRecPtr = *RecPtr;
 		memcpy(readRecordBuf, record, total_len);
@@ -3989,8 +3972,8 @@ retry:
 	if (record->xl_rmid == RM_XLOG_ID && record->xl_info == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
-		EndRecPtr.xrecoff += XLogSegSize - 1;
-		EndRecPtr.xrecoff -= EndRecPtr.xrecoff % XLogSegSize;
+		EndRecPtr += XLogSegSize - 1;
+		EndRecPtr -= EndRecPtr % XLogSegSize;
 
 		/*
 		 * Pretend that readBuf contains the last page of the segment. This is
@@ -4101,7 +4084,7 @@ ValidXLogPageHeader(XLogPageHeader hdr, int emode)
 	{
 		ereport(emode_for_corrupt_record(emode, recaddr),
 				(errmsg("unexpected pageaddr %X/%X in log segment %s, offset %u",
-						hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
+						(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
 						XLogFileNameP(curFileTLI, readSegNo),
 						readOff)));
 		return false;
@@ -4162,7 +4145,7 @@ ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
 		{
 			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errmsg("invalid xlog switch record at %X/%X",
-							RecPtr->xlogid, RecPtr->xrecoff)));
+							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
 			return false;
 		}
 	}
@@ -4170,7 +4153,7 @@ ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
 	{
 		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("record with zero length at %X/%X",
-						RecPtr->xlogid, RecPtr->xrecoff)));
+						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
 		return false;
 	}
 	if (record->xl_tot_len < SizeOfXLogRecord + record->xl_len ||
@@ -4179,14 +4162,14 @@ ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
 	{
 		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("invalid record length at %X/%X",
-						RecPtr->xlogid, RecPtr->xrecoff)));
+						(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
 		return false;
 	}
 	if (record->xl_rmid > RM_MAX_ID)
 	{
 		ereport(emode_for_corrupt_record(emode, *RecPtr),
 				(errmsg("invalid resource manager ID %u at %X/%X",
-						record->xl_rmid, RecPtr->xlogid, RecPtr->xrecoff)));
+						record->xl_rmid, (uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
 		return false;
 	}
 	if (randAccess)
@@ -4199,8 +4182,8 @@ ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
 		{
 			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							record->xl_prev.xlogid, record->xl_prev.xrecoff,
-							RecPtr->xlogid, RecPtr->xrecoff)));
+							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
 			return false;
 		}
 	}
@@ -4215,8 +4198,8 @@ ValidXLogRecordHeader(XLogRecPtr *RecPtr, XLogRecord *record, int emode,
 		{
 			ereport(emode_for_corrupt_record(emode, *RecPtr),
 					(errmsg("record with incorrect prev-link %X/%X at %X/%X",
-							record->xl_prev.xlogid, record->xl_prev.xrecoff,
-							RecPtr->xlogid, RecPtr->xrecoff)));
+							(uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
+							(uint32) ((*RecPtr) >> 32), (uint32) *RecPtr)));
 			return false;
 		}
 	}
@@ -5189,8 +5172,7 @@ BootStrapXLOG(void)
 	 * segment with logid=0 logseg=1. The very first WAL segment, 0/0, is not
 	 * used, so that we can use 0/0 to mean "before any valid WAL segment".
 	 */
-	checkPoint.redo.xlogid = 0;
-	checkPoint.redo.xrecoff = XLogSegSize + SizeOfXLogLongPHD;
+	checkPoint.redo = XLogSegSize + SizeOfXLogLongPHD;
 	checkPoint.ThisTimeLineID = ThisTimeLineID;
 	checkPoint.fullPageWrites = fullPageWrites;
 	checkPoint.nextXidEpoch = 0;
@@ -5213,8 +5195,7 @@ BootStrapXLOG(void)
 	page->xlp_magic = XLOG_PAGE_MAGIC;
 	page->xlp_info = XLP_LONG_HEADER;
 	page->xlp_tli = ThisTimeLineID;
-	page->xlp_pageaddr.xlogid = 0;
-	page->xlp_pageaddr.xrecoff = XLogSegSize;
+	page->xlp_pageaddr = XLogSegSize;
 	longpage = (XLogLongPageHeader) page;
 	longpage->xlp_sysid = sysidentifier;
 	longpage->xlp_seg_size = XLogSegSize;
@@ -5222,8 +5203,7 @@ BootStrapXLOG(void)
 
 	/* Insert the initial checkpoint record */
 	record = (XLogRecord *) ((char *) page + SizeOfXLogLongPHD);
-	record->xl_prev.xlogid = 0;
-	record->xl_prev.xrecoff = 0;
+	record->xl_prev = 0;
 	record->xl_xid = InvalidTransactionId;
 	record->xl_tot_len = SizeOfXLogRecord + sizeof(checkPoint);
 	record->xl_len = sizeof(checkPoint);
@@ -6017,7 +5997,7 @@ StartupXLOG(void)
 
 	if (ControlFile->state < DB_SHUTDOWNED ||
 		ControlFile->state > DB_IN_PRODUCTION ||
-		!XRecOffIsValid(ControlFile->checkPoint.xrecoff))
+		!XRecOffIsValid(ControlFile->checkPoint))
 		ereport(FATAL,
 				(errmsg("control file contains invalid data")));
 
@@ -6153,7 +6133,7 @@ StartupXLOG(void)
 			wasShutdown = (record->xl_info == XLOG_CHECKPOINT_SHUTDOWN);
 			ereport(DEBUG1,
 					(errmsg("checkpoint record is at %X/%X",
-							checkPointLoc.xlogid, checkPointLoc.xrecoff)));
+							(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
 			InRecovery = true;	/* force recovery even if SHUTDOWNED */
 
 			/*
@@ -6193,7 +6173,7 @@ StartupXLOG(void)
 		{
 			ereport(DEBUG1,
 					(errmsg("checkpoint record is at %X/%X",
-							checkPointLoc.xlogid, checkPointLoc.xrecoff)));
+							(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
 		}
 		else if (StandbyMode)
 		{
@@ -6212,7 +6192,7 @@ StartupXLOG(void)
 			{
 				ereport(LOG,
 						(errmsg("using previous checkpoint record at %X/%X",
-							  checkPointLoc.xlogid, checkPointLoc.xrecoff)));
+								(uint32) (checkPointLoc >> 32), (uint32) checkPointLoc)));
 				InRecovery = true;		/* force recovery even if SHUTDOWNED */
 			}
 			else
@@ -6227,7 +6207,7 @@ StartupXLOG(void)
 
 	ereport(DEBUG1,
 			(errmsg("redo record is at %X/%X; shutdown %s",
-					checkPoint.redo.xlogid, checkPoint.redo.xrecoff,
+					(uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo,
 					wasShutdown ? "TRUE" : "FALSE")));
 	ereport(DEBUG1,
 			(errmsg("next transaction ID: %u/%u; next OID: %u",
@@ -6537,7 +6517,7 @@ StartupXLOG(void)
 
 			ereport(LOG,
 					(errmsg("redo starts at %X/%X",
-							ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));
+							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 
 			/*
 			 * main redo apply loop
@@ -6553,8 +6533,8 @@ StartupXLOG(void)
 
 					initStringInfo(&buf);
 					appendStringInfo(&buf, "REDO @ %X/%X; LSN %X/%X: ",
-									 ReadRecPtr.xlogid, ReadRecPtr.xrecoff,
-									 EndRecPtr.xlogid, EndRecPtr.xrecoff);
+									 (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr,
+									 (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr);
 					xlog_outrec(&buf, record);
 					appendStringInfo(&buf, " - ");
 					RmgrTable[record->xl_rmid].rm_desc(&buf,
@@ -6682,7 +6662,7 @@ StartupXLOG(void)
 
 			ereport(LOG,
 					(errmsg("redo done at %X/%X",
-							ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));
+							(uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
 			xtime = GetLatestXTime();
 			if (xtime)
 				ereport(LOG,
@@ -6815,19 +6795,17 @@ StartupXLOG(void)
 	openLogOff = 0;
 	Insert = &XLogCtl->Insert;
 	Insert->PrevRecord = LastRec;
-	XLogCtl->xlblocks[0].xlogid = (openLogSegNo * XLOG_SEG_SIZE) >> 32;
-	XLogCtl->xlblocks[0].xrecoff =
-		((EndOfLog.xrecoff - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
+	XLogCtl->xlblocks[0] = ((EndOfLog - 1) / XLOG_BLCKSZ + 1) * XLOG_BLCKSZ;
 
 	/*
 	 * Tricky point here: readBuf contains the *last* block that the LastRec
 	 * record spans, not the one it starts in.	The last block is indeed the
 	 * one we want to use.
 	 */
-	Assert(readOff == (XLogCtl->xlblocks[0].xrecoff - XLOG_BLCKSZ) % XLogSegSize);
+	Assert(readOff == (XLogCtl->xlblocks[0] - XLOG_BLCKSZ) % XLogSegSize);
 	memcpy((char *) Insert->currpage, readBuf, XLOG_BLCKSZ);
 	Insert->currpos = (char *) Insert->currpage +
-		(EndOfLog.xrecoff + XLOG_BLCKSZ - XLogCtl->xlblocks[0].xrecoff);
+		(EndOfLog + XLOG_BLCKSZ - XLogCtl->xlblocks[0]);
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
@@ -7048,7 +7026,7 @@ CheckRecoveryConsistency(void)
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
-						EndRecPtr.xlogid, EndRecPtr.xrecoff)));
+						(uint32) (EndRecPtr >> 32), (uint32) EndRecPtr)));
 	}
 
 	/*
@@ -7207,7 +7185,7 @@ ReadCheckpointRecord(XLogRecPtr RecPtr, int whichChkpt)
 {
 	XLogRecord *record;
 
-	if (!XRecOffIsValid(RecPtr.xrecoff))
+	if (!XRecOffIsValid(RecPtr))
 	{
 		switch (whichChkpt)
 		{
@@ -8068,8 +8046,8 @@ RecoveryRestartPoint(const CheckPoint *checkPoint)
 				elog(trace_recovery(DEBUG2),
 					 "RM %d not safe to record restart point at %X/%X",
 					 rmid,
-					 checkPoint->redo.xlogid,
-					 checkPoint->redo.xrecoff);
+					 (uint32) (checkPoint->redo >> 32),
+					 (uint32) checkPoint->redo);
 				return;
 			}
 	}
@@ -8085,8 +8063,8 @@ RecoveryRestartPoint(const CheckPoint *checkPoint)
 		elog(trace_recovery(DEBUG2),
 			 "could not record restart point at %X/%X because there "
 			 "are unresolved references to invalid pages",
-			 checkPoint->redo.xlogid,
-			 checkPoint->redo.xrecoff);
+			 (uint32) (checkPoint->redo >> 32),
+			 (uint32) checkPoint->redo);
 		return;
 	}
 
@@ -8165,7 +8143,7 @@ CreateRestartPoint(int flags)
 	{
 		ereport(DEBUG2,
 				(errmsg("skipping restartpoint, already performed at %X/%X",
-				  lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff)));
+						(uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo)));
 
 		UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
 		if (flags & CHECKPOINT_IS_SHUTDOWN)
@@ -8275,7 +8253,7 @@ CreateRestartPoint(int flags)
 	xtime = GetLatestXTime();
 	ereport((log_checkpoints ? LOG : DEBUG2),
 			(errmsg("recovery restart point at %X/%X",
-					lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff),
+					(uint32) (lastCheckPoint.redo >> 32), (uint32) lastCheckPoint.redo),
 		   xtime ? errdetail("last completed transaction was at log time %s",
 							 timestamptz_to_str(xtime)) : 0));
 
@@ -8401,7 +8379,7 @@ XLogRestorePoint(const char *rpName)
 
 	ereport(LOG,
 			(errmsg("restore point \"%s\" created at %X/%X",
-					rpName, RecPtr.xlogid, RecPtr.xrecoff)));
+					rpName, (uint32) (RecPtr >> 32), (uint32) RecPtr)));
 
 	return RecPtr;
 }
@@ -8750,8 +8728,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 		 * decreasing max_* settings.
 		 */
 		minRecoveryPoint = ControlFile->minRecoveryPoint;
-		if ((minRecoveryPoint.xlogid != 0 || minRecoveryPoint.xrecoff != 0)
-			&& XLByteLT(minRecoveryPoint, lsn))
+		if (minRecoveryPoint != 0 && XLByteLT(minRecoveryPoint, lsn))
 		{
 			ControlFile->minRecoveryPoint = lsn;
 		}
@@ -8801,7 +8778,7 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
 		appendStringInfo(buf, "checkpoint: redo %X/%X; "
 						 "tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
 						 "oldest xid %u in DB %u; oldest running xid %u; %s",
-						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
+						 (uint32) (checkpoint->redo >> 32), (uint32) checkpoint->redo,
 						 checkpoint->ThisTimeLineID,
 						 checkpoint->fullPageWrites ? "true" : "false",
 						 checkpoint->nextXidEpoch, checkpoint->nextXid,
@@ -8841,7 +8818,7 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
 
 		memcpy(&startpoint, rec, sizeof(XLogRecPtr));
 		appendStringInfo(buf, "backup end: %X/%X",
-						 startpoint.xlogid, startpoint.xrecoff);
+						 (uint32) (startpoint >> 32), (uint32) startpoint);
 	}
 	else if (info == XLOG_PARAMETER_CHANGE)
 	{
@@ -8887,7 +8864,7 @@ xlog_outrec(StringInfo buf, XLogRecord *record)
 	int			i;
 
 	appendStringInfo(buf, "prev %X/%X; xid %u",
-					 record->xl_prev.xlogid, record->xl_prev.xrecoff,
+					 (uint32) (record->xl_prev >> 32), (uint32) record->xl_prev,
 					 record->xl_xid);
 
 	appendStringInfo(buf, "; len %u",
@@ -9286,9 +9263,9 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 					"%Y-%m-%d %H:%M:%S %Z",
 					pg_localtime(&stamp_time, log_timezone));
 		appendStringInfo(&labelfbuf, "START WAL LOCATION: %X/%X (file %s)\n",
-						 startpoint.xlogid, startpoint.xrecoff, xlogfilename);
+						 (uint32) (startpoint >> 32), (uint32) startpoint, xlogfilename);
 		appendStringInfo(&labelfbuf, "CHECKPOINT LOCATION: %X/%X\n",
-						 checkpointloc.xlogid, checkpointloc.xrecoff);
+						 (uint32) (checkpointloc >> 32), (uint32) checkpointloc);
 		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
 						 exclusive ? "pg_start_backup" : "streamed");
 		appendStringInfo(&labelfbuf, "BACKUP FROM: %s\n",
@@ -9408,6 +9385,8 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	bool		reported_waiting = false;
 	char	   *remaining;
 	char	   *ptr;
+	uint32		hi,
+				lo;
 
 	backup_started_in_recovery = RecoveryInProgress();
 
@@ -9512,11 +9491,12 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	 * but we are not expecting any variability in the file format).
 	 */
 	if (sscanf(labelfile, "START WAL LOCATION: %X/%X (file %24s)%c",
-			   &startpoint.xlogid, &startpoint.xrecoff, startxlogfilename,
+			   &hi, &lo, startxlogfilename,
 			   &ch) != 4 || ch != '\n')
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+	startpoint = ((uint64) hi) << 32 | lo;
 	remaining = strchr(labelfile, '\n') + 1;	/* %n is not portable enough */
 
 	/*
@@ -9624,7 +9604,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 	 */
 	XLByteToSeg(startpoint, _logSegNo);
 	BackupHistoryFilePath(histfilepath, ThisTimeLineID, _logSegNo,
-						  startpoint.xrecoff % XLogSegSize);
+						  (uint32) (startpoint % XLogSegSize));
 	fp = AllocateFile(histfilepath, "w");
 	if (!fp)
 		ereport(ERROR,
@@ -9632,9 +9612,9 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 				 errmsg("could not create file \"%s\": %m",
 						histfilepath)));
 	fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
-			startpoint.xlogid, startpoint.xrecoff, startxlogfilename);
+			(uint32) (startpoint >> 32), (uint32) startpoint, startxlogfilename);
 	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
-			stoppoint.xlogid, stoppoint.xrecoff, stopxlogfilename);
+			(uint32) (stoppoint >> 32), (uint32) stoppoint, stopxlogfilename);
 	/* transfer remaining lines from label to history file */
 	fprintf(fp, "%s", remaining);
 	fprintf(fp, "STOP TIME: %s\n", strfbuf);
@@ -9677,7 +9657,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive)
 
 		XLByteToSeg(startpoint, _logSegNo);
 		BackupHistoryFileName(histfilename, ThisTimeLineID, _logSegNo,
-							  startpoint.xrecoff % XLogSegSize);
+							  (uint32) (startpoint % XLogSegSize));
 
 		seconds_before_warning = 60;
 		waits = 0;
@@ -9853,6 +9833,8 @@ read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired,
 	char		ch;
 	char		backuptype[20];
 	char		backupfrom[20];
+	uint32		hi,
+				lo;
 
 	*backupEndRequired = false;
 	*backupFromStandby = false;
@@ -9877,17 +9859,18 @@ read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired,
 	 * format).
 	 */
 	if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %08X%16s)%c",
-			   &RedoStartLSN.xlogid, &RedoStartLSN.xrecoff, &tli,
-			   startxlogfilename, &ch) != 5 || ch != '\n')
+			   &hi, &lo, &tli, startxlogfilename, &ch) != 5 || ch != '\n')
 		ereport(FATAL,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+	RedoStartLSN = ((uint64) hi) << 32 | lo;
 	if (fscanf(lfp, "CHECKPOINT LOCATION: %X/%X%c",
-			   &checkPointLoc->xlogid, &checkPointLoc->xrecoff,
-			   &ch) != 3 || ch != '\n')
+			   &hi, &lo, &ch) != 3 || ch != '\n')
 		ereport(FATAL,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+	*checkPointLoc = ((uint64) hi) << 32 | lo;
+
 	/*
 	 * BACKUP METHOD and BACKUP FROM lines are new in 9.2. We can't
 	 * restore from an older backup anyway, but since the information on it
@@ -10009,7 +9992,7 @@ static bool
 XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 			 bool randAccess)
 {
-	static XLogRecPtr receivedUpto = {0, 0};
+	static XLogRecPtr receivedUpto = 0;
 	bool		switched_segment = false;
 	uint32		targetPageOff;
 	uint32		targetRecOff;
@@ -10017,8 +10000,8 @@ XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 	static pg_time_t last_fail_time = 0;
 
 	XLByteToSeg(*RecPtr, targetSegNo);
-	targetPageOff = ((RecPtr->xrecoff % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
-	targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;
+	targetPageOff = (((*RecPtr) % XLogSegSize) / XLOG_BLCKSZ) * XLOG_BLCKSZ;
+	targetRecOff = (*RecPtr) % XLOG_BLCKSZ;
 
 	/* Fast exit if we have read the record in the current buffer already */
 	if (failedSources == 0 && targetSegNo == readSegNo &&
@@ -10299,13 +10282,12 @@ retry:
 	 */
 	if (readSource == XLOG_FROM_STREAM)
 	{
-		if (RecPtr->xlogid != receivedUpto.xlogid ||
-			(RecPtr->xrecoff / XLOG_BLCKSZ) != (receivedUpto.xrecoff / XLOG_BLCKSZ))
+		if (((*RecPtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
 		{
 			readLen = XLOG_BLCKSZ;
 		}
 		else
-			readLen = receivedUpto.xrecoff % XLogSegSize - targetPageOff;
+			readLen = receivedUpto % XLogSegSize - targetPageOff;
 	}
 	else
 		readLen = XLOG_BLCKSZ;
@@ -10411,7 +10393,7 @@ triggered:
 static int
 emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
-	static XLogRecPtr lastComplaint = {0, 0};
+	static XLogRecPtr lastComplaint = 0;
 
 	if (readSource == XLOG_FROM_PG_XLOG && emode == LOG)
 	{
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index a289baa..fd60448 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -57,7 +57,7 @@ pg_start_backup(PG_FUNCTION_ARGS)
 	startpoint = do_pg_start_backup(backupidstr, fast, NULL);
 
 	snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
-			 startpoint.xlogid, startpoint.xrecoff);
+			 (uint32) (startpoint >> 32), (uint32) startpoint);
 	PG_RETURN_TEXT_P(cstring_to_text(startxlogstr));
 }
 
@@ -83,7 +83,7 @@ pg_stop_backup(PG_FUNCTION_ARGS)
 	stoppoint = do_pg_stop_backup(NULL, true);
 
 	snprintf(stopxlogstr, sizeof(stopxlogstr), "%X/%X",
-			 stoppoint.xlogid, stoppoint.xrecoff);
+			 (uint32) (stoppoint >> 32), (uint32) stoppoint);
 	PG_RETURN_TEXT_P(cstring_to_text(stopxlogstr));
 }
 
@@ -113,7 +113,7 @@ pg_switch_xlog(PG_FUNCTION_ARGS)
 	 * As a convenience, return the WAL location of the switch record
 	 */
 	snprintf(location, sizeof(location), "%X/%X",
-			 switchpoint.xlogid, switchpoint.xrecoff);
+			 (uint32) (switchpoint >> 32), (uint32) switchpoint);
 	PG_RETURN_TEXT_P(cstring_to_text(location));
 }
 
@@ -158,7 +158,7 @@ pg_create_restore_point(PG_FUNCTION_ARGS)
 	 * As a convenience, return the WAL location of the restore point record
 	 */
 	snprintf(location, sizeof(location), "%X/%X",
-			 restorepoint.xlogid, restorepoint.xrecoff);
+			 (uint32) (restorepoint >> 32), (uint32) restorepoint);
 	PG_RETURN_TEXT_P(cstring_to_text(location));
 }
 
@@ -184,7 +184,7 @@ pg_current_xlog_location(PG_FUNCTION_ARGS)
 	current_recptr = GetXLogWriteRecPtr();
 
 	snprintf(location, sizeof(location), "%X/%X",
-			 current_recptr.xlogid, current_recptr.xrecoff);
+			 (uint32) (current_recptr >> 32), (uint32) current_recptr);
 	PG_RETURN_TEXT_P(cstring_to_text(location));
 }
 
@@ -208,7 +208,7 @@ pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
 	current_recptr = GetXLogInsertRecPtr();
 
 	snprintf(location, sizeof(location), "%X/%X",
-			 current_recptr.xlogid, current_recptr.xrecoff);
+			 (uint32) (current_recptr >> 32), (uint32) current_recptr);
 	PG_RETURN_TEXT_P(cstring_to_text(location));
 }
 
@@ -226,11 +226,11 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
 
 	recptr = GetWalRcvWriteRecPtr(NULL);
 
-	if (recptr.xlogid == 0 && recptr.xrecoff == 0)
+	if (recptr == 0)
 		PG_RETURN_NULL();
 
 	snprintf(location, sizeof(location), "%X/%X",
-			 recptr.xlogid, recptr.xrecoff);
+			 (uint32) (recptr >> 32), (uint32) recptr);
 	PG_RETURN_TEXT_P(cstring_to_text(location));
 }
 
@@ -248,11 +248,11 @@ pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
 
 	recptr = GetXLogReplayRecPtr(NULL);
 
-	if (recptr.xlogid == 0 && recptr.xrecoff == 0)
+	if (recptr == 0)
 		PG_RETURN_NULL();
 
 	snprintf(location, sizeof(location), "%X/%X",
-			 recptr.xlogid, recptr.xrecoff);
+			 (uint32) (recptr >> 32), (uint32) recptr);
 	PG_RETURN_TEXT_P(cstring_to_text(location));
 }
 
@@ -269,8 +269,8 @@ pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
 {
 	text	   *location = PG_GETARG_TEXT_P(0);
 	char	   *locationstr;
-	unsigned int uxlogid;
-	unsigned int uxrecoff;
+	uint32		hi,
+				lo;
 	XLogSegNo	xlogsegno;
 	uint32		xrecoff;
 	XLogRecPtr	locationpoint;
@@ -294,14 +294,12 @@ pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
 
 	validate_xlog_location(locationstr);
 
-	if (sscanf(locationstr, "%X/%X", &uxlogid, &uxrecoff) != 2)
+	if (sscanf(locationstr, "%X/%X", &hi, &lo) != 2)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("could not parse transaction log location \"%s\"",
 						locationstr)));
-
-	locationpoint.xlogid = uxlogid;
-	locationpoint.xrecoff = uxrecoff;
+	locationpoint = ((uint64) hi) << 32 | lo;
 
 	/*
 	 * Construct a tuple descriptor for the result row.  This must match this
@@ -327,7 +325,7 @@ pg_xlogfile_name_offset(PG_FUNCTION_ARGS)
 	/*
 	 * offset
 	 */
-	xrecoff = locationpoint.xrecoff % XLogSegSize;
+	xrecoff = locationpoint % XLogSegSize;
 
 	values[1] = UInt32GetDatum(xrecoff);
 	isnull[1] = false;
@@ -351,8 +349,8 @@ pg_xlogfile_name(PG_FUNCTION_ARGS)
 {
 	text	   *location = PG_GETARG_TEXT_P(0);
 	char	   *locationstr;
-	unsigned int uxlogid;
-	unsigned int uxrecoff;
+	uint32		hi,
+				lo;
 	XLogSegNo	xlogsegno;
 	XLogRecPtr	locationpoint;
 	char		xlogfilename[MAXFNAMELEN];
@@ -367,14 +365,12 @@ pg_xlogfile_name(PG_FUNCTION_ARGS)
 
 	validate_xlog_location(locationstr);
 
-	if (sscanf(locationstr, "%X/%X", &uxlogid, &uxrecoff) != 2)
+	if (sscanf(locationstr, "%X/%X", &hi, &lo) != 2)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("could not parse transaction log location \"%s\"",
 						locationstr)));
-
-	locationpoint.xlogid = uxlogid;
-	locationpoint.xrecoff = uxrecoff;
+	locationpoint = ((uint64) hi) << 32 | lo;
 
 	XLByteToPrevSeg(locationpoint, xlogsegno);
 	XLogFileName(xlogfilename, ThisTimeLineID, xlogsegno);
@@ -514,6 +510,8 @@ pg_xlog_location_diff(PG_FUNCTION_ARGS)
 	Numeric		result;
 	uint64		bytes1,
 				bytes2;
+	uint32		hi,
+				lo;
 
 	/*
 	 * Read and parse input
@@ -524,17 +522,20 @@ pg_xlog_location_diff(PG_FUNCTION_ARGS)
 	validate_xlog_location(str1);
 	validate_xlog_location(str2);
 
-	if (sscanf(str1, "%X/%X", &loc1.xlogid, &loc1.xrecoff) != 2)
+	if (sscanf(str1, "%X/%X", &hi, &lo) != 2)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 		   errmsg("could not parse transaction log location \"%s\"", str1)));
-	if (sscanf(str2, "%X/%X", &loc2.xlogid, &loc2.xrecoff) != 2)
+	loc1 = ((uint64) hi) << 32 | lo;
+
+	if (sscanf(str2, "%X/%X", &hi, &lo) != 2)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 		   errmsg("could not parse transaction log location \"%s\"", str2)));
+	loc2 = ((uint64) hi) << 32 | lo;
 
-	bytes1 = (((uint64)loc1.xlogid) << 32L) + loc1.xrecoff;
-	bytes2 = (((uint64)loc2.xlogid) << 32L) + loc2.xrecoff;
+	bytes1 = (uint64) loc1;
+	bytes2 = (uint64) loc2;
 
 	/*
 	 * result = bytes1 - bytes2.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 39229eb..e4cbdcd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -629,7 +629,7 @@ CheckArchiveTimeout(void)
 		 * If the returned pointer points exactly to a segment boundary,
 		 * assume nothing happened.
 		 */
-		if ((switchpoint.xrecoff % XLogSegSize) != 0)
+		if ((switchpoint % XLogSegSize) != 0)
 			ereport(DEBUG1,
 				(errmsg("transaction log switch forced (archive_timeout=%d)",
 						XLogArchiveTimeout)));
@@ -775,10 +775,7 @@ IsCheckpointOnSchedule(double progress)
 	if (!RecoveryInProgress())
 	{
 		recptr = GetInsertRecPtr();
-		elapsed_xlogs =
-			(((double) ((uint64) (recptr.xlogid - ckpt_start_recptr.xlogid) << 32L)) +
-			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-			CheckPointSegments;
+		elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
 
 		if (progress < elapsed_xlogs)
 		{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index f5b8e32..594e4dd 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -496,7 +496,7 @@ SendXlogRecPtrResult(XLogRecPtr ptr)
 	StringInfoData buf;
 	char		str[MAXFNAMELEN];
 
-	snprintf(str, sizeof(str), "%X/%X", ptr.xlogid, ptr.xrecoff);
+	snprintf(str, sizeof(str), "%X/%X", (uint32) (ptr >> 32), (uint32) ptr);
 
 	pq_beginmessage(&buf, 'T'); /* RowDescription */
 	pq_sendint(&buf, 1, 2);		/* 1 field */
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 979b66b..bfaebea 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -156,7 +156,7 @@ libpqrcv_connect(char *conninfo, XLogRecPtr startpoint)
 
 	/* Start streaming from the point requested by startup process */
 	snprintf(cmd, sizeof(cmd), "START_REPLICATION %X/%X",
-			 startpoint.xlogid, startpoint.xrecoff);
+			 (uint32) (startpoint >> 32), (uint32) startpoint);
 	res = libpqrcv_PQexec(cmd);
 	if (PQresultStatus(res) != PGRES_COPY_BOTH)
 	{
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 9d4edcf..51f381d 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -72,8 +72,11 @@ START_REPLICATION	{ return K_START_REPLICATION; }
 " "				;
 
 {hexdigit}+\/{hexdigit}+		{
-					if (sscanf(yytext, "%X/%X", &yylval.recptr.xlogid, &yylval.recptr.xrecoff) != 2)
+					uint32	hi,
+							lo;
+					if (sscanf(yytext, "%X/%X", &hi, &lo) != 2)
 						yyerror("invalid streaming start location");
+					yylval.recptr = ((uint64) hi) << 32 | lo;
 					return RECPTR;
 				}
 
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 8977327..70a7020 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -145,7 +145,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 		new_status = (char *) palloc(len + 32 + 1);
 		memcpy(new_status, old_status, len);
 		sprintf(new_status + len, " waiting for %X/%X",
-				XactCommitLSN.xlogid, XactCommitLSN.xrecoff);
+				(uint32) (XactCommitLSN >> 32), (uint32) XactCommitLSN);
 		set_ps_display(new_status, false);
 		new_status[len] = '\0'; /* truncate off " waiting ..." */
 	}
@@ -255,8 +255,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 */
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
-	MyProc->waitLSN.xlogid = 0;
-	MyProc->waitLSN.xrecoff = 0;
+	MyProc->waitLSN = 0;
 
 	if (new_status)
 	{
@@ -439,12 +438,8 @@ SyncRepReleaseWaiters(void)
 	LWLockRelease(SyncRepLock);
 
 	elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
-		 numwrite,
-		 MyWalSnd->write.xlogid,
-		 MyWalSnd->write.xrecoff,
-		 numflush,
-		 MyWalSnd->flush.xlogid,
-		 MyWalSnd->flush.xrecoff);
+		 numwrite, (uint32) (MyWalSnd->write >> 32), (uint32) MyWalSnd->write,
+		 numflush, (uint32) (MyWalSnd->flush >> 32), (uint32) MyWalSnd->flush);
 
 	/*
 	 * If we are managing the highest priority standby, though we weren't
@@ -629,8 +624,7 @@ SyncRepQueueIsOrderedByLSN(int mode)
 
 	Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
 
-	lastLSN.xlogid = 0;
-	lastLSN.xrecoff = 0;
+	lastLSN = 0;
 
 	proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
 								   &(WalSndCtl->SyncRepQueue[mode]),
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 8cbfd7b..9930fde 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -516,7 +516,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		}
 
 		/* Calculate the start offset of the received logs */
-		startoff = recptr.xrecoff % XLogSegSize;
+		startoff = recptr % XLogSegSize;
 
 		if (startoff + nbytes > XLogSegSize)
 			segbytes = XLogSegSize - startoff;
@@ -601,8 +601,8 @@ XLogWalRcvFlush(bool dying)
 			char		activitymsg[50];
 
 			snprintf(activitymsg, sizeof(activitymsg), "streaming %X/%X",
-					 LogstreamResult.Write.xlogid,
-					 LogstreamResult.Write.xrecoff);
+					 (uint32) (LogstreamResult.Write >> 32),
+					 (uint32) LogstreamResult.Write);
 			set_ps_display(activitymsg, false);
 		}
 
@@ -657,9 +657,9 @@ XLogWalRcvSendReply(void)
 	reply_message.sendTime = now;
 
 	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
-		 reply_message.write.xlogid, reply_message.write.xrecoff,
-		 reply_message.flush.xlogid, reply_message.flush.xrecoff,
-		 reply_message.apply.xlogid, reply_message.apply.xrecoff);
+		 (uint32) (reply_message.write >> 32), (uint32) reply_message.write,
+		 (uint32) (reply_message.flush >> 32), (uint32) reply_message.flush,
+		 (uint32) (reply_message.apply >> 32), (uint32) reply_message.apply);
 
 	/* Prepend with the message type and send it. */
 	buf[0] = 'r';
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index f8dd523..7ad4da3 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -185,8 +185,8 @@ RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo)
 	 * being created by XLOG streaming, which might cause trouble later on if
 	 * the segment is e.g archived.
 	 */
-	if (recptr.xrecoff % XLogSegSize != 0)
-		recptr.xrecoff -= recptr.xrecoff % XLogSegSize;
+	if (recptr % XLogSegSize != 0)
+		recptr -= recptr % XLogSegSize;
 
 	SpinLockAcquire(&walrcv->mutex);
 
@@ -204,8 +204,7 @@ RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo)
 	 * If this is the first startup of walreceiver, we initialize receivedUpto
 	 * and latestChunkStart to receiveStart.
 	 */
-	if (walrcv->receiveStart.xlogid == 0 &&
-		walrcv->receiveStart.xrecoff == 0)
+	if (walrcv->receiveStart == 0)
 	{
 		walrcv->receivedUpto = recptr;
 		walrcv->latestChunkStart = recptr;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3b26eff..7e22fa1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -93,7 +93,7 @@ static uint32 sendOff = 0;
  * How far have we sent WAL already? This is also advertised in
  * MyWalSnd->sentPtr.  (Actually, this is the next WAL location to send.)
  */
-static XLogRecPtr sentPtr = {0, 0};
+static XLogRecPtr sentPtr = 0;
 
 /*
  * Buffer for processing reply messages.
@@ -299,8 +299,7 @@ IdentifySystem(void)
 
 	logptr = am_cascading_walsender ? GetStandbyFlushRecPtr() : GetInsertRecPtr();
 
-	snprintf(xpos, sizeof(xpos), "%X/%X",
-			 logptr.xlogid, logptr.xrecoff);
+	snprintf(xpos, sizeof(xpos), "%X/%X", (uint32) (logptr >> 32), (uint32) logptr);
 
 	/* Send a RowDescription message */
 	pq_beginmessage(&buf, 'T');
@@ -612,9 +611,9 @@ ProcessStandbyReplyMessage(void)
 	pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage));
 
 	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X",
-		 reply.write.xlogid, reply.write.xrecoff,
-		 reply.flush.xlogid, reply.flush.xrecoff,
-		 reply.apply.xlogid, reply.apply.xrecoff);
+		 (uint32) (reply.write << 32), (uint32) reply.write,
+		 (uint32) (reply.flush << 32), (uint32) reply.flush,
+		 (uint32) (reply.apply << 32), (uint32) reply.apply);
 
 	/*
 	 * Update shared state for this WalSender process based on reply data from
@@ -989,7 +988,7 @@ retry:
 		int			segbytes;
 		int			readbytes;
 
-		startoff = recptr.xrecoff % XLogSegSize;
+		startoff = recptr % XLogSegSize;
 
 		if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
 		{
@@ -1155,12 +1154,6 @@ XLogSend(char *msgbuf, bool *caughtup)
 	startptr = sentPtr;
 	endptr = startptr;
 	XLByteAdvance(endptr, MAX_SEND_SIZE);
-	if (endptr.xlogid != startptr.xlogid)
-	{
-		/* Don't cross a logfile boundary within one message */
-		Assert(endptr.xlogid == startptr.xlogid + 1);
-		endptr.xrecoff = 0;
-	}
 
 	/* if we went beyond SendRqstPtr, back off */
 	if (XLByteLE(SendRqstPtr, endptr))
@@ -1171,14 +1164,11 @@ XLogSend(char *msgbuf, bool *caughtup)
 	else
 	{
 		/* round down to page boundary. */
-		endptr.xrecoff -= (endptr.xrecoff % XLOG_BLCKSZ);
+		endptr -= (endptr % XLOG_BLCKSZ);
 		*caughtup = false;
 	}
 
-	if (endptr.xrecoff == 0)
-		nbytes = 0x100000000L - (uint64) startptr.xrecoff;
-	else
-		nbytes = endptr.xrecoff - startptr.xrecoff;
+	nbytes = endptr - startptr;
 	Assert(nbytes <= MAX_SEND_SIZE);
 
 	/*
@@ -1222,7 +1212,7 @@ XLogSend(char *msgbuf, bool *caughtup)
 		char		activitymsg[50];
 
 		snprintf(activitymsg, sizeof(activitymsg), "streaming %X/%X",
-				 sentPtr.xlogid, sentPtr.xrecoff);
+				 (uint32) (sentPtr >> 32), (uint32) sentPtr);
 		set_ps_display(activitymsg, false);
 	}
 
@@ -1564,25 +1554,25 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 			values[1] = CStringGetTextDatum(WalSndGetStateString(state));
 
 			snprintf(location, sizeof(location), "%X/%X",
-					 sentPtr.xlogid, sentPtr.xrecoff);
+					 (uint32) (sentPtr >> 32), (uint32) sentPtr);
 			values[2] = CStringGetTextDatum(location);
 
-			if (write.xlogid == 0 && write.xrecoff == 0)
+			if (write == 0)
 				nulls[3] = true;
 			snprintf(location, sizeof(location), "%X/%X",
-					 write.xlogid, write.xrecoff);
+					 (uint32) (write >> 32), (uint32) write);
 			values[3] = CStringGetTextDatum(location);
 
-			if (flush.xlogid == 0 && flush.xrecoff == 0)
+			if (flush == 0)
 				nulls[4] = true;
 			snprintf(location, sizeof(location), "%X/%X",
-					 flush.xlogid, flush.xrecoff);
+					 (uint32) (flush >> 32), (uint32) flush);
 			values[4] = CStringGetTextDatum(location);
 
-			if (apply.xlogid == 0 && apply.xrecoff == 0)
+			if (apply == 0)
 				nulls[5] = true;
 			snprintf(location, sizeof(location), "%X/%X",
-					 apply.xlogid, apply.xrecoff);
+					 (uint32) (apply >> 32), (uint32) apply);
 			values[5] = CStringGetTextDatum(location);
 
 			values[6] = Int32GetDatum(sync_priority[i]);
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 3a6831c..3e757a7 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -936,7 +936,7 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		elog(trace_recovery(DEBUG2),
 			 "snapshot of %u running transactions overflowed (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
 			 CurrRunningXacts->xcnt,
-			 recptr.xlogid, recptr.xrecoff,
+			 (uint32) (recptr >> 32), (uint32) recptr,
 			 CurrRunningXacts->oldestRunningXid,
 			 CurrRunningXacts->latestCompletedXid,
 			 CurrRunningXacts->nextXid);
@@ -944,7 +944,7 @@ LogCurrentRunningXacts(RunningTransactions CurrRunningXacts)
 		elog(trace_recovery(DEBUG2),
 			 "snapshot of %u running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)",
 			 CurrRunningXacts->xcnt,
-			 recptr.xlogid, recptr.xrecoff,
+			 (uint32) (recptr >> 32), (uint32) recptr,
 			 CurrRunningXacts->oldestRunningXid,
 			 CurrRunningXacts->latestCompletedXid,
 			 CurrRunningXacts->nextXid);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 458cd27..662bf53 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -376,8 +376,7 @@ InitProcess(void)
 	MyProc->recoveryConflictPending = false;
 
 	/* Initialize fields for sync rep */
-	MyProc->waitLSN.xlogid = 0;
-	MyProc->waitLSN.xrecoff = 0;
+	MyProc->waitLSN = 0;
 	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
 	SHMQueueElemInit(&(MyProc->syncRepLinks));
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index d746616..bda23fa 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -159,6 +159,8 @@ reached_end_position(XLogRecPtr segendpos, uint32 timeline, bool segment_finishe
 		if (r == 1)
 		{
 			char		xlogend[64];
+			uint32		hi,
+						lo;
 
 			MemSet(xlogend, 0, sizeof(xlogend));
 			r = read(bgpipe[0], xlogend, sizeof(xlogend));
@@ -169,12 +171,13 @@ reached_end_position(XLogRecPtr segendpos, uint32 timeline, bool segment_finishe
 				exit(1);
 			}
 
-			if (sscanf(xlogend, "%X/%X", &xlogendptr.xlogid, &xlogendptr.xrecoff) != 2)
+			if (sscanf(xlogend, "%X/%X", &hi, &lo) != 2)
 			{
 				fprintf(stderr, _("%s: could not parse xlog end position \"%s\"\n"),
 						progname, xlogend);
 				exit(1);
 			}
+			xlogendptr = ((uint64) hi) << 32 | lo;
 			has_xlogendptr = 1;
 
 			/*
@@ -204,9 +207,7 @@ reached_end_position(XLogRecPtr segendpos, uint32 timeline, bool segment_finishe
 	 * At this point we have an end pointer, so compare it to the current
 	 * position to figure out if it's time to stop.
 	 */
-	if (segendpos.xlogid > xlogendptr.xlogid ||
-		(segendpos.xlogid == xlogendptr.xlogid &&
-		 segendpos.xrecoff >= xlogendptr.xrecoff))
+	if (segendpos >= xlogendptr)
 		return true;
 
 	/*
@@ -252,20 +253,23 @@ static void
 StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 {
 	logstreamer_param *param;
+	uint32		hi,
+				lo;
 
 	param = xmalloc0(sizeof(logstreamer_param));
 	param->timeline = timeline;
 	param->sysidentifier = sysidentifier;
 
 	/* Convert the starting position */
-	if (sscanf(startpos, "%X/%X", &param->startptr.xlogid, &param->startptr.xrecoff) != 2)
+	if (sscanf(startpos, "%X/%X", &hi, &lo) != 2)
 	{
 		fprintf(stderr, _("%s: invalid format of xlog location: %s\n"),
 				progname, startpos);
 		disconnect_and_exit(1);
 	}
+	param->startptr = ((uint64) hi) << 32 | lo;
 	/* Round off to even segment position */
-	param->startptr.xrecoff -= param->startptr.xrecoff % XLOG_SEG_SIZE;
+	param->startptr -= param->startptr % XLOG_SEG_SIZE;
 
 #ifndef WIN32
 	/* Create our background pipe */
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index 5a7ad81..9bc46ac 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -78,7 +78,9 @@ stop_streaming(XLogRecPtr segendpos, uint32 timeline, bool segment_finished)
 {
 	if (verbose && segment_finished)
 		fprintf(stderr, _("%s: finished segment at %X/%X (timeline %u)\n"),
-				progname, segendpos.xlogid, segendpos.xrecoff, timeline);
+				progname,
+				(uint32) (segendpos >> 32), (uint32) segendpos,
+				timeline);
 
 	if (time_to_abort)
 	{
@@ -212,6 +214,8 @@ StreamLog(void)
 	PGresult   *res;
 	uint32		timeline;
 	XLogRecPtr	startpos;
+	uint32		hi,
+				lo;
 
 	/*
 	 * Connect in replication mode to the server
@@ -239,12 +243,13 @@ StreamLog(void)
 		disconnect_and_exit(1);
 	}
 	timeline = atoi(PQgetvalue(res, 0, 1));
-	if (sscanf(PQgetvalue(res, 0, 2), "%X/%X", &startpos.xlogid, &startpos.xrecoff) != 2)
+	if (sscanf(PQgetvalue(res, 0, 2), "%X/%X", &hi, &lo) != 2)
 	{
 		fprintf(stderr, _("%s: could not parse log start position from value \"%s\"\n"),
 				progname, PQgetvalue(res, 0, 2));
 		disconnect_and_exit(1);
 	}
+	startpos = ((uint64) hi) << 32 | lo;
 	PQclear(res);
 
 	/*
@@ -255,14 +260,16 @@ StreamLog(void)
 	/*
 	 * Always start streaming at the beginning of a segment
 	 */
-	startpos.xrecoff -= startpos.xrecoff % XLOG_SEG_SIZE;
+	startpos -= startpos % XLOG_SEG_SIZE;
 
 	/*
 	 * Start the replication
 	 */
 	if (verbose)
 		fprintf(stderr, _("%s: starting log streaming at %X/%X (timeline %u)\n"),
-				progname, startpos.xlogid, startpos.xrecoff, timeline);
+				progname,
+				(uint32) (startpos >> 32), (uint32) startpos,
+				timeline);
 
 	ReceiveXlogStream(conn, startpos, timeline, NULL, basedir,
 					  stop_streaming,
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 6cb209b..25e00a0 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -37,8 +37,6 @@
 #define STREAMING_HEADER_SIZE (1+sizeof(WalDataMessageHeader))
 #define STREAMING_KEEPALIVE_SIZE (1+sizeof(PrimaryKeepaliveMessage))
 
-const XLogRecPtr InvalidXLogRecPtr = {0, 0};
-
 /*
  * Open a new WAL file in the specified directory. Store the name
  * (not including the full directory) in namebuf. Assumes there is
@@ -264,7 +262,8 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, char *sysi
 	}
 
 	/* Initiate the replication stream at specified location */
-	snprintf(query, sizeof(query), "START_REPLICATION %X/%X", startpos.xlogid, startpos.xrecoff);
+	snprintf(query, sizeof(query), "START_REPLICATION %X/%X",
+			 (uint32) (startpos >> 32), (uint32) startpos);
 	res = PQexec(conn, query);
 	if (PQresultStatus(res) != PGRES_COPY_BOTH)
 	{
@@ -418,7 +417,7 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, char *sysi
 
 		/* Extract WAL location for this block */
 		memcpy(&blockpos, copybuf + 1, 8);
-		xlogoff = blockpos.xrecoff % XLOG_SEG_SIZE;
+		xlogoff = blockpos % XLOG_SEG_SIZE;
 
 		/*
 		 * Verify that the initial location in the stream matches where we
@@ -490,7 +489,7 @@ ReceiveXlogStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline, char *sysi
 			xlogoff += bytes_to_write;
 
 			/* Did we reach the end of a WAL segment? */
-			if (blockpos.xrecoff % XLOG_SEG_SIZE == 0)
+			if (blockpos % XLOG_SEG_SIZE == 0)
 			{
 				if (!close_walfile(walfile, basedir, current_walfile_name, false))
 					/* Error message written in close_walfile() */
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index c00183a..55d3bd9 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -199,14 +199,14 @@ main(int argc, char *argv[])
 	printf(_("pg_control last modified:             %s\n"),
 		   pgctime_str);
 	printf(_("Latest checkpoint location:           %X/%X\n"),
-		   ControlFile.checkPoint.xlogid,
-		   ControlFile.checkPoint.xrecoff);
+		   (uint32) (ControlFile.checkPoint >> 32),
+		   (uint32) ControlFile.checkPoint);
 	printf(_("Prior checkpoint location:            %X/%X\n"),
-		   ControlFile.prevCheckPoint.xlogid,
-		   ControlFile.prevCheckPoint.xrecoff);
+		   (uint32) (ControlFile.prevCheckPoint >> 32),
+		   (uint32) ControlFile.prevCheckPoint);
 	printf(_("Latest checkpoint's REDO location:    %X/%X\n"),
-		   ControlFile.checkPointCopy.redo.xlogid,
-		   ControlFile.checkPointCopy.redo.xrecoff);
+		   (uint32) (ControlFile.checkPointCopy.redo >> 32),
+		   (uint32) ControlFile.checkPointCopy.redo);
 	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
 		   ControlFile.checkPointCopy.ThisTimeLineID);
 	printf(_("Latest checkpoint's full_page_writes: %s\n"),
@@ -229,14 +229,14 @@ main(int argc, char *argv[])
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Minimum recovery ending location:     %X/%X\n"),
-		   ControlFile.minRecoveryPoint.xlogid,
-		   ControlFile.minRecoveryPoint.xrecoff);
+		   (uint32) (ControlFile.minRecoveryPoint >> 32),
+		   (uint32) ControlFile.minRecoveryPoint);
 	printf(_("Backup start location:                %X/%X\n"),
-		   ControlFile.backupStartPoint.xlogid,
-		   ControlFile.backupStartPoint.xrecoff);
+		   (uint32) (ControlFile.backupStartPoint >> 32),
+		   (uint32) ControlFile.backupStartPoint);
 	printf(_("Backup end location:                  %X/%X\n"),
-		   ControlFile.backupEndPoint.xlogid,
-		   ControlFile.backupEndPoint.xrecoff);
+		   (uint32) (ControlFile.backupEndPoint >> 32),
+		   (uint32) ControlFile.backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile.backupEndRequired ? _("yes") : _("no"));
 	printf(_("Current wal_level setting:            %s\n"),
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 15f2b27..e500b7d 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -463,8 +463,7 @@ GuessControlValues(void)
 
 	ControlFile.system_identifier = sysidentifier;
 
-	ControlFile.checkPointCopy.redo.xlogid = 0;
-	ControlFile.checkPointCopy.redo.xrecoff = SizeOfXLogLongPHD;
+	ControlFile.checkPointCopy.redo = SizeOfXLogLongPHD;
 	ControlFile.checkPointCopy.ThisTimeLineID = 1;
 	ControlFile.checkPointCopy.fullPageWrites = false;
 	ControlFile.checkPointCopy.nextXidEpoch = 0;
@@ -611,14 +610,10 @@ RewriteControlFile(void)
 	ControlFile.state = DB_SHUTDOWNED;
 	ControlFile.time = (pg_time_t) time(NULL);
 	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
-	ControlFile.prevCheckPoint.xlogid = 0;
-	ControlFile.prevCheckPoint.xrecoff = 0;
-	ControlFile.minRecoveryPoint.xlogid = 0;
-	ControlFile.minRecoveryPoint.xrecoff = 0;
-	ControlFile.backupStartPoint.xlogid = 0;
-	ControlFile.backupStartPoint.xrecoff = 0;
-	ControlFile.backupEndPoint.xlogid = 0;
-	ControlFile.backupEndPoint.xrecoff = 0;
+	ControlFile.prevCheckPoint = 0;
+	ControlFile.minRecoveryPoint = 0;
+	ControlFile.backupStartPoint = 0;
+	ControlFile.backupEndPoint = 0;
 	ControlFile.backupEndRequired = false;
 
 	/*
@@ -714,8 +709,7 @@ FindEndOfXLOG(void)
 	 * numbering according to the old xlog seg size.
 	 */
 	segs_per_xlogid = (0x100000000L / ControlFile.xlog_seg_size);
-	newXlogSegNo = ((uint64) ControlFile.checkPointCopy.redo.xlogid) * segs_per_xlogid
-		+ (ControlFile.checkPointCopy.redo.xrecoff / ControlFile.xlog_seg_size);
+	newXlogSegNo = ControlFile.checkPointCopy.redo / ControlFile.xlog_seg_size;
 
 	/*
 	 * Scan the pg_xlog directory to find existing WAL segment files. We
@@ -919,10 +913,7 @@ WriteEmptyXLOG(void)
 	page->xlp_magic = XLOG_PAGE_MAGIC;
 	page->xlp_info = XLP_LONG_HEADER;
 	page->xlp_tli = ControlFile.checkPointCopy.ThisTimeLineID;
-	page->xlp_pageaddr.xlogid =
-		ControlFile.checkPointCopy.redo.xlogid;
-	page->xlp_pageaddr.xrecoff =
-		ControlFile.checkPointCopy.redo.xrecoff - SizeOfXLogLongPHD;
+	page->xlp_pageaddr = ControlFile.checkPointCopy.redo - SizeOfXLogLongPHD;
 	longpage = (XLogLongPageHeader) page;
 	longpage->xlp_sysid = ControlFile.system_identifier;
 	longpage->xlp_seg_size = XLogSegSize;
@@ -930,8 +921,7 @@ WriteEmptyXLOG(void)
 
 	/* Insert the initial checkpoint record */
 	record = (XLogRecord *) ((char *) page + SizeOfXLogLongPHD);
-	record->xl_prev.xlogid = 0;
-	record->xl_prev.xrecoff = 0;
+	record->xl_prev = 0;
 	record->xl_xid = InvalidTransactionId;
 	record->xl_tot_len = SizeOfXLogRecord + sizeof(CheckPoint);
 	record->xl_len = sizeof(CheckPoint);
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 9373865..228f6a1 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -139,10 +139,6 @@ extern bool TransactionStartedDuringRecovery(void);
 /* in transam/varsup.c */
 extern PGDLLIMPORT VariableCache ShmemVariableCache;
 
-/* in transam/transam.c */
-extern const XLogRecPtr InvalidXLogRecPtr;
-
-
 /*
  * prototypes for functions in transam/transam.c
  */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index a958856..54208f5 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -113,10 +113,7 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 #define XLogSegmentsPerXLogId	(0x100000000L / XLOG_SEG_SIZE)
 
 #define XLogSegNoOffsetToRecPtr(segno, offset, dest) \
-	do {	\
-		(dest).xlogid = (segno) / XLogSegmentsPerXLogId;				\
-		(dest).xrecoff = ((segno) % XLogSegmentsPerXLogId) * XLOG_SEG_SIZE + (offset); \
-	} while (0)
+		(dest) = (segno) * XLOG_SEG_SIZE + (offset)
 
 /*
  * Macros for manipulating XLOG pointers
@@ -125,8 +122,8 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 /* Align a record pointer to next page */
 #define NextLogPage(recptr) \
 	do {	\
-		if ((recptr).xrecoff % XLOG_BLCKSZ != 0)	\
-			XLByteAdvance(recptr, (XLOG_BLCKSZ - (recptr).xrecoff % XLOG_BLCKSZ)); \
+		if ((recptr) % XLOG_BLCKSZ != 0)	\
+			XLByteAdvance(recptr, (XLOG_BLCKSZ - (recptr) % XLOG_BLCKSZ)); \
 	} while (0)
 
 /*
@@ -135,14 +132,13 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
  * For XLByteToSeg, do the computation at face value.  For XLByteToPrevSeg,
  * a boundary byte is taken to be in the previous segment.	This is suitable
  * for deciding which segment to write given a pointer to a record end,
- * for example.  (We can assume xrecoff is not zero, since no valid recptr
- * can have that.)
+ * for example.
  */
 #define XLByteToSeg(xlrp, logSegNo)	\
-	logSegNo = ((uint64) (xlrp).xlogid * XLogSegmentsPerXLogId) + (xlrp).xrecoff / XLogSegSize
+	logSegNo = (xlrp) / XLogSegSize
 
 #define XLByteToPrevSeg(xlrp, logSegNo)	\
-	logSegNo = ((uint64) (xlrp).xlogid * XLogSegmentsPerXLogId) + ((xlrp).xrecoff - 1) / XLogSegSize
+	logSegNo = ((xlrp) - 1) / XLogSegSize
 
 /*
  * Is an XLogRecPtr within a particular XLOG segment?
@@ -151,20 +147,15 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
  * a boundary byte is taken to be in the previous segment.
  */
 #define XLByteInSeg(xlrp, logSegNo)	\
-	(((xlrp).xlogid) == (logSegNo) / XLogSegmentsPerXLogId &&			\
-	 ((xlrp).xrecoff / XLogSegSize) == (logSegNo) % XLogSegmentsPerXLogId)
+	(((xlrp) / XLogSegSize) == (logSegNo))
 
 #define XLByteInPrevSeg(xlrp, logSegNo)	\
-	(((xlrp).xrecoff == 0) ?											\
-		(((xlrp).xlogid - 1) == (logSegNo) / XLogSegmentsPerXLogId && \
-		 ((uint32) 0xffffffff) / XLogSegSize == (logSegNo) % XLogSegmentsPerXLogId) : \
-		((xlrp).xlogid) == (logSegNo) / XLogSegmentsPerXLogId &&	\
-		 (((xlrp).xrecoff - 1) / XLogSegSize) == (logSegNo) % XLogSegmentsPerXLogId)
-
-/* Check if an xrecoff value is in a plausible range */
-#define XRecOffIsValid(xrecoff) \
-		((xrecoff) % XLOG_BLCKSZ >= SizeOfXLogShortPHD && \
-		(XLOG_BLCKSZ - (xrecoff) % XLOG_BLCKSZ) >= SizeOfXLogRecord)
+	((((xlrp) - 1) / XLogSegSize) == (logSegNo))
+
+/* Check if an XLogRecPtr value is in a plausible range */
+#define XRecOffIsValid(xlrp) \
+		((xlrp) % XLOG_BLCKSZ >= SizeOfXLogShortPHD && \
+		 (XLOG_BLCKSZ - (xlrp) % XLOG_BLCKSZ) >= SizeOfXLogRecord)
 
 /*
  * The XLog directory and control file (relative to $PGDATA)
diff --git a/src/include/access/xlogdefs.h b/src/include/access/xlogdefs.h
index 6038548..153d0de 100644
--- a/src/include/access/xlogdefs.h
+++ b/src/include/access/xlogdefs.h
@@ -17,55 +17,30 @@
 /*
  * Pointer to a location in the XLOG.  These pointers are 64 bits wide,
  * because we don't want them ever to overflow.
- *
- * NOTE: xrecoff == 0 is used to indicate an invalid pointer.  This is OK
- * because we use page headers in the XLOG, so no XLOG record can start
- * right at the beginning of a file.
- *
- * NOTE: the "log file number" is somewhat misnamed, since the actual files
- * making up the XLOG are much smaller than 4Gb.  Each actual file is an
- * XLogSegSize-byte "segment" of a logical log file having the indicated
- * xlogid.	The log file number and segment number together identify a
- * physical XLOG file.	Segment number and offset within the physical file
- * are computed from xrecoff div and mod XLogSegSize.
  */
-typedef struct XLogRecPtr
-{
-	uint32		xlogid;			/* log file #, 0 based */
-	uint32		xrecoff;		/* byte offset of location in log file */
-} XLogRecPtr;
-
-#define XLogRecPtrIsInvalid(r)	((r).xrecoff == 0)
+typedef uint64 XLogRecPtr;
 
+/*
+ * Zero is used indicate an invalid pointer. Bootstrap skips the first possible
+ * WAL segment, initializing the first WAL page at XLOG_SEG_SIZE, so no XLOG
+ * record can begin at zero.
+ */
+#define InvalidXLogRecPtr	0
+#define XLogRecPtrIsInvalid(r)	((r) == InvalidXLogRecPtr)
 
 /*
  * Macros for comparing XLogRecPtrs
- *
- * Beware of passing expressions with side-effects to these macros,
- * since the arguments may be evaluated multiple times.
  */
-#define XLByteLT(a, b)		\
-			((a).xlogid < (b).xlogid || \
-			 ((a).xlogid == (b).xlogid && (a).xrecoff < (b).xrecoff))
-
-#define XLByteLE(a, b)		\
-			((a).xlogid < (b).xlogid || \
-			 ((a).xlogid == (b).xlogid && (a).xrecoff <= (b).xrecoff))
-
-#define XLByteEQ(a, b)		\
-			((a).xlogid == (b).xlogid && (a).xrecoff == (b).xrecoff)
+#define XLByteLT(a, b)		((a) < (b))
+#define XLByteLE(a, b)		((a) <= (b))
+#define XLByteEQ(a, b)		((a) == (b))
 
 
 /*
  * Macro for advancing a record pointer by the specified number of bytes.
  */
 #define XLByteAdvance(recptr, nbytes)						\
-	do {													\
-		uint32 oldxrecoff = (recptr).xrecoff;				\
-		(recptr).xrecoff += nbytes;							\
-		if ((recptr).xrecoff < oldxrecoff)					\
-			(recptr).xlogid += 1;		/* xrecoff wrapped around */	\
-	} while (0)
+		(recptr) += nbytes									\
 
 /*
  * XLogSegNo - physical log file sequence number.
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 1031e56..a437947 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -21,7 +21,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	922
+#define PG_CONTROL_VERSION	931
 
 /*
  * Body of CheckPoint XLOG records.  This is declared here because we keep
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 1ab64e0..fc3a69b 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -119,10 +119,18 @@ typedef uint16 LocationIndex;
  * On the high end, we can only support pages up to 32KB because lp_off/lp_len
  * are 15 bits.
  */
+
+/* for historical reasons, the LSN is stored as two 32-bit values. */
+typedef struct
+{
+	uint32		xlogid;			/* high bits */
+	uint32		xrecoff;		/* low bits */
+} PageXLogRecPtr;
+
 typedef struct PageHeaderData
 {
 	/* XXX LSN is member of *any* block, not only page-organized ones */
-	XLogRecPtr	pd_lsn;			/* LSN: next byte after last byte of xlog
+	PageXLogRecPtr	pd_lsn;			/* LSN: next byte after last byte of xlog
 								 * record for last change to this page */
 	uint16		pd_tli;			/* least significant bits of the TimeLineID
 								 * containing the LSN */
@@ -314,9 +322,10 @@ typedef PageHeaderData *PageHeader;
  * Additional macros for access to page headers
  */
 #define PageGetLSN(page) \
-	(((PageHeader) (page))->pd_lsn)
+	((uint64) ((PageHeader) (page))->pd_lsn.xlogid << 32 | ((PageHeader) (page))->pd_lsn.xrecoff)
 #define PageSetLSN(page, lsn) \
-	(((PageHeader) (page))->pd_lsn = (lsn))
+	(((PageHeader) (page))->pd_lsn.xlogid = (uint32) ((lsn) >> 32),	\
+	 ((PageHeader) (page))->pd_lsn.xrecoff = (uint32) (lsn))
 
 /* NOTE: only the 16 least significant bits are stored */
 #define PageGetTLI(page) \

#13

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Heikki Linnakangas (#12)

Re: WAL format changes

On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:

On 18.06.2012 21:08, Heikki Linnakangas wrote:

On 18.06.2012 21:00, Robert Haas wrote:

On Thu, Jun 14, 2012 at 5:58 PM, Andres Freund<andres@2ndquadrant.com>

wrote:

1. Use a 64-bit segment number, instead of the log/seg combination.
And don't waste the last segment on each logical 4 GB log file. The
concept of a "logical log file" is now completely gone. XLogRecPtr is
unchanged,
but it should now be understood as a plain 64-bit value, just split
into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

Whats the reason for keeping that awkward split now? There aren't
that many
users of xlogid/xcrecoff and many of those would be better served by
using
helper macros.

I wondered that, too. There may be a good reason for keeping it split
up that way, but we at least oughta think about it a bit.

The page header contains an XLogRecPtr (LSN), so if we change it we'll
have to deal with pg_upgrade. I guess we could still keep XLogRecPtr
around as the on-disk representation, and convert between the 64-bit
integer and XLogRecPtr in PageGetLSN/PageSetLSN. I can try that out -
many xlog calculations would admittedly be simpler if it was an uint64.

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure
if having two representations which just have a constant factor inbetween
really makes sense.

The LSNs on pages are still stored in the old format, to avoid changing
the on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the
control file and WAL are changed, however, so an initdb (or at least
pg_resetxlog) is required.

Sounds sensible.

Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol.

The replication protocol uses pq_sendint for integers which should take care
of converting to big endian already. I don't think anything but the wal itself
is otherwise transported in a binary fashion? So I don't think there is any
such architecture dependency in the protocol currently?

I don't really see a point in keeping around a backward-compatible
representation just for the sake of running such tools on multiple versions. I
might not be pragmatic enough, but: Why would you want to do that *at the
moment*? Many of the other tools are already version specific, so...
When the protocol starts to be used by more tools, maybe, but imo were not
there yet.

But then its not hard to convert to the old representation for that.

I kept the %X/%X representation in error messages etc. I'm quite used to
that output, so reluctant to change it, although it's a bit silly now
that it represents just 64-bit value. Using UINT64_FORMAT would also
make the messages harder to translate.

No opinion on that. Its easier to see for me whether two values are exactly
the same or very similar with the 64bit representation but its harder to gauge
bigger differences. So ...

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#14

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Heikki Linnakangas (#12)

Re: WAL format changes

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
a uint64, on top of my other WAL format patches. I think we should go ahead
with this.

+1.

The LSNs on pages are still stored in the old format, to avoid changing the
on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
is required.

Seems fine.

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in >= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already. I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now. At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

I kept the %X/%X representation in error messages etc. I'm quite used to
that output, so reluctant to change it, although it's a bit silly now that
it represents just 64-bit value. Using UINT64_FORMAT would also make the
messages harder to translate.

I could go either way on this one, but I have no problem with the way
you did it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 13 years ago

In reply to: Andres Freund (#13)

Re: WAL format changes

On 19.06.2012 18:46, Andres Freund wrote:

On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not sure
if having two representations which just have a constant factor inbetween
really makes sense.

I wasn't planning to, it didn't even occur to me that we might be able
to get rid of XLogSegNo to be honest. There's places that deal whole
segments, rather than with specific byte positions in the WAL, so I
think XLogSegNo makes more sense in that context. Take
XLogArchiveNotifySeg(), for example. It notifies the archiver that a
given segment is ready for archiving, so we pass an XLogSegNo to
identify that segment as an argument. I suppose we could pass an
XLogRecPtr that points to the beginning of the segment instead, but it
doesn't really feel like an improvement to me.

Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol.

The replication protocol uses pq_sendint for integers which should take care
of converting to big endian already. I don't think anything but the wal itself
is otherwise transported in a binary fashion? So I don't think there is any
such architecture dependency in the protocol currently?

We only use pg_sendint() for the few values exchanged in the handshake
before we start replicating, but once we begin, we just send structs
around. For example, in ProcessStandbyReplyMessage():

static void
ProcessStandbyReplyMessage(void)
{
StandbyReplyMessage reply;

pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage));
...

After that, we just the fields in the reply struct like in any other struct.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#16

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Heikki Linnakangas (#15)

Re: WAL format changes

Hi,

On Wednesday, June 20, 2012 12:24:54 AM Heikki Linnakangas wrote:

On 19.06.2012 18:46, Andres Freund wrote:

On Tuesday, June 19, 2012 10:14:08 AM Heikki Linnakangas wrote:

Well, that was easier than I thought. Attached is a patch to make
XLogRecPtr a uint64, on top of my other WAL format patches. I think we
should go ahead with this.

Cool. You plan to merge XLogSegNo with XLogRecPtr in that case? I am not
sure if having two representations which just have a constant factor
inbetween really makes sense.

I wasn't planning to, it didn't even occur to me that we might be able
to get rid of XLogSegNo to be honest. There's places that deal whole
segments, rather than with specific byte positions in the WAL, so I
think XLogSegNo makes more sense in that context. Take
XLogArchiveNotifySeg(), for example. It notifies the archiver that a
given segment is ready for archiving, so we pass an XLogSegNo to
identify that segment as an argument. I suppose we could pass an
XLogRecPtr that points to the beginning of the segment instead, but it
doesn't really feel like an improvement to me.

I am not sure its a win either, was just wondering because they now are that
similar.

Should we keep the old representation in the replication protocol
messages? That would make it simpler to write a client that works with
different server versions (like pg_receivexlog). Or, while we're at it,
perhaps we should mandate network-byte order for all the integer and
XLogRecPtr fields in the replication protocol.

The replication protocol uses pq_sendint for integers which should take
care of converting to big endian already. I don't think anything but the
wal itself is otherwise transported in a binary fashion? So I don't
think there is any such architecture dependency in the protocol
currently?

We only use pg_sendint() for the few values exchanged in the handshake
before we start replicating, but once we begin, we just send structs

around. For example, in ProcessStandbyReplyMessage():

static void
ProcessStandbyReplyMessage(void)
{

StandbyReplyMessage reply;

pq_copymsgbytes(&reply_message, (char *) &reply,
sizeof(StandbyReplyMessage));

...

After that, we just the fields in the reply struct like in any other
struct.

Yes, forgot that, true. I guess the best fix would be to actually send normal
messages instead of CopyData ones? Much more to type though...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

Magnus Hagander

magnus@hagander.net

over 13 years ago

In reply to: Robert Haas (#14)

Re: WAL format changes

On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
a uint64, on top of my other WAL format patches. I think we should go ahead
with this.

+1.

The LSNs on pages are still stored in the old format, to avoid changing the
on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
is required.

Seems fine.

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in >= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already. I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now. At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#18

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Magnus Hagander (#17)

Re: WAL format changes

On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
a uint64, on top of my other WAL format patches. I think we should go ahead
with this.

+1.

The LSNs on pages are still stored in the old format, to avoid changing the
on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
is required.

Seems fine.

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in >= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already. I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now. At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?

pg_receivexlog in 9.2 cannot handle correctly the WAL location "FF"
(which was skipped in 9.2 or before). For example, pg_receivexlog calls
XLByteAdvance() which always skips "FF". So even if we change the protocol,
ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
might send "FF". No?

Regards,

--
Fujii Masao

#19

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 13 years ago

In reply to: Fujii Masao (#18)

Re: WAL format changes

On 20.06.2012 20:43, Fujii Masao wrote:

On Wed, Jun 20, 2012 at 8:19 PM, Magnus Hagander<magnus@hagander.net> wrote:

On Tue, Jun 19, 2012 at 5:57 PM, Robert Haas<robertmhaas@gmail.com> wrote:

On Tue, Jun 19, 2012 at 4:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Well, that was easier than I thought. Attached is a patch to make XLogRecPtr
a uint64, on top of my other WAL format patches. I think we should go ahead
with this.

+1.

The LSNs on pages are still stored in the old format, to avoid changing the
on-disk format and breaking pg_upgrade. The XLogRecPtrs stored the control
file and WAL are changed, however, so an initdb (or at least pg_resetxlog)
is required.

Seems fine.

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in>= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already. I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now. At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

If we are looking at breaking it, and we are especially concerned
about something like pg_receivexlog... Is it something we could/should
change in the protocl *now* for 9.2, to make it non-broken in any
released version? As in, can we extract just the protocol change and
backpatch that to 9.2beta?

pg_receivexlog in 9.2 cannot handle correctly the WAL location "FF"
(which was skipped in 9.2 or before). For example, pg_receivexlog calls
XLByteAdvance() which always skips "FF". So even if we change the protocol,
ISTM pg_receivexlog in 9.2 cannot work well with the server in 9.3 which
might send "FF". No?

Yeah, you can't use pg_receivexlog from 9.2 against a 9.3 server. We
can't really promise compatibility when using an older client against a
newer server, but we can try to be backwards-compatible in the other
direction. I'm thinking of using a 9.3 pg_receivexlog against a 9.2 server.

But I guess Robert is right and we shouldn't worry about
backwards-compatibility at this point. Instead, let's try to get the
protocol right, so that we can more easily provide
backwards-compatibility in the future. Like, using a 9.4 pg_receivexlog
against a 9.3 server.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 13 years ago

In reply to: Robert Haas (#14)

Re: WAL format changes

Ok, committed all the WAL format changes now.

On 19.06.2012 18:57, Robert Haas wrote:

Should we keep the old representation in the replication protocol messages?
That would make it simpler to write a client that works with different
server versions (like pg_receivexlog). Or, while we're at it, perhaps we
should mandate network-byte order for all the integer and XLogRecPtr fields
in the replication protocol. That would make it easier to write a client
that works across different architectures, in>= 9.3. The contents of the
WAL would of course be architecture-dependent, but it would be nice if
pg_receivexlog and similar tools could nevertheless be
architecture-independent.

I share Andres' question about how we're doing this already. I think
if we're going to break this, I'd rather do it in 9.3 than 5 years
from now. At this point it's just a minor annoyance, but it'll
probably get worse as people write more tools that understand WAL.

I didn't touch the replication protocol yet, but I think we should do it
some time during 9.3.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Heikki Linnakangas (#20)

Re: WAL format changes

On 24 June 2012 17:24, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Ok, committed all the WAL format changes now.

Nice!

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#22

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Heikki Linnakangas (#20)

1 attachment(s)

Re: WAL format changes

On Mon, Jun 25, 2012 at 1:24 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Ok, committed all the WAL format changes now.

This breaks pg_resetxlog -l at all. When I ran "pg_resetxlog -l 0x01,0x01,0x01"
in the HEAD, I got the following error message though the same command
successfully completed in 9.1.

pg_resetxlog: invalid argument for option -l
Try "pg_resetxlog --help" for more information.

I think the attached patch needs to be applied.

Regards,

--
Fujii Masao

Attachments:

resetxlog_bugfix_v1.patchapplication/octet-stream; name=resetxlog_bugfix_v1.patchDownload

*** a/src/bin/pg_resetxlog/pg_resetxlog.c
--- b/src/bin/pg_resetxlog/pg_resetxlog.c
***************
*** 86,94 **** main(int argc, char *argv[])
  	Oid			set_oid = 0;
  	MultiXactId set_mxid = 0;
  	MultiXactOffset set_mxoff = (MultiXactOffset) -1;
! 	uint32		minXlogTli = 0;
  	XLogSegNo	minXlogSegNo = 0;
  	char	   *endptr;
  	char	   *DataDir;
  	int			fd;
  	char		path[MAXPGPATH];
--- 86,98 ----
  	Oid			set_oid = 0;
  	MultiXactId set_mxid = 0;
  	MultiXactOffset set_mxoff = (MultiXactOffset) -1;
! 	uint32		minXlogTli = 0,
! 				minXlogId = 0,
! 				minXlogSeg = 0;
  	XLogSegNo	minXlogSegNo = 0;
  	char	   *endptr;
+ 	char	   *endptr2;
+ 	char	   *endptr3;
  	char	   *DataDir;
  	int			fd;
  	char		path[MAXPGPATH];
***************
*** 200,212 **** main(int argc, char *argv[])
  				break;
  
  			case 'l':
! 				if (strspn(optarg, "01234567890ABCDEFabcdef") != 24)
  				{
  					fprintf(stderr, _("%s: invalid argument for option -l\n"), progname);
  					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
  					exit(1);
  				}
! 				XLogFromFileName(optarg, &minXlogTli, &minXlogSegNo);
  				break;
  
  			default:
--- 204,231 ----
  				break;
  
  			case 'l':
! 				minXlogTli = strtoul(optarg, &endptr, 0);
! 				if (endptr == optarg || *endptr != ',')
  				{
  					fprintf(stderr, _("%s: invalid argument for option -l\n"), progname);
  					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
  					exit(1);
  				}
! 				minXlogId = strtoul(endptr + 1, &endptr2, 0);
! 				if (endptr2 == endptr + 1 || *endptr2 != ',')
! 				{
! 					fprintf(stderr, _("%s: invalid argument for option -l\n"), progname);
! 					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
! 					exit(1);
! 				}
! 				minXlogSeg = strtoul(endptr2 + 1, &endptr3, 0);
! 				if (endptr3 == endptr2 + 1 || *endptr3 != '\0')
! 				{
! 					fprintf(stderr, _("%s: invalid argument for option -l\n"), progname);
! 					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
! 					exit(1);
! 				}
! 				minXlogSegNo = (uint64) minXlogId * XLogSegmentsPerXLogId + minXlogSeg;
  				break;
  
  			default:

#23

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Heikki Linnakangas (#20)

Re: WAL format changes

On Mon, Jun 25, 2012 at 1:24 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Ok, committed all the WAL format changes now.

I found the typo.

In walsender.c
-		 reply.write.xlogid, reply.write.xrecoff,
-		 reply.flush.xlogid, reply.flush.xrecoff,
-		 reply.apply.xlogid, reply.apply.xrecoff);
+		 (uint32) (reply.write << 32), (uint32) reply.write,
+		 (uint32) (reply.flush << 32), (uint32) reply.flush,
+		 (uint32) (reply.apply << 32), (uint32) reply.apply);

"<<" should be ">>". The attached patch fixes this typo.

Regards,

--
Fujii Masao

#24

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Fujii Masao (#23)

1 attachment(s)

Re: WAL format changes

On Tue, Jun 26, 2012 at 2:53 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Jun 25, 2012 at 1:24 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Ok, committed all the WAL format changes now.

I found the typo.
In walsender.c
-                reply.write.xlogid, reply.write.xrecoff,
-                reply.flush.xlogid, reply.flush.xrecoff,
-                reply.apply.xlogid, reply.apply.xrecoff);
+                (uint32) (reply.write << 32), (uint32) reply.write,
+                (uint32) (reply.flush << 32), (uint32) reply.flush,
+                (uint32) (reply.apply << 32), (uint32) reply.apply);
"<<" should be ">>". The attached patch fixes this typo.

Oh, I forgot to attach the patch.. Here is the patch.

Regards,

--
Fujii Masao

Attachments:

walsender_typo_v1.patchapplication/octet-stream; name=walsender_typo_v1.patchDownload

*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 612,620 **** ProcessStandbyReplyMessage(void)
  	pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage));
  
  	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X",
! 		 (uint32) (reply.write << 32), (uint32) reply.write,
! 		 (uint32) (reply.flush << 32), (uint32) reply.flush,
! 		 (uint32) (reply.apply << 32), (uint32) reply.apply);
  
  	/*
  	 * Update shared state for this WalSender process based on reply data from
--- 612,620 ----
  	pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage));
  
  	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X",
! 		 (uint32) (reply.write >> 32), (uint32) reply.write,
! 		 (uint32) (reply.flush >> 32), (uint32) reply.flush,
! 		 (uint32) (reply.apply >> 32), (uint32) reply.apply);
  
  	/*
  	 * Update shared state for this WalSender process based on reply data from

#25

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Fujii Masao (#24)

Re: WAL format changes

On Mon, Jun 25, 2012 at 1:57 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

"<<" should be ">>". The attached patch fixes this typo.

Oh, I forgot to attach the patch.. Here is the patch.

I committed both of the patches you posted to this thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 13 years ago

In reply to: Robert Haas (#25)

Re: WAL format changes

On 25.06.2012 21:01, Robert Haas wrote:

On Mon, Jun 25, 2012 at 1:57 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

"<<" should be">>". The attached patch fixes this typo.

Oh, I forgot to attach the patch.. Here is the patch.

I committed both of the patches you posted to this thread.

Thanks Robert. I was thinking that "pg_resetxlog -l" would accept a WAL
file name, instead of comma-separated tli, xlogid, segno arguments. The
latter is a bit meaningless now that we don't use the xlogid+segno
combination anywhere else. Alvaro pointed out that pg_upgrade was broken
by the change in pg_resetxlog -n output - I changed that too to print
the "First log segment after reset" information as a WAL file name,
instead of logid+segno. Another option would be to print the 64-bit
segment number, but I think that's worse, because the 64-bit segment
number is harder to associate with a physical WAL file.

So I think we should change pg_resetxlog -l option to take a WAL file
name as argument, and fix pg_upgrade accordingly.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#27

Alvaro Herrera

alvherre@commandprompt.com

over 13 years ago

In reply to: Heikki Linnakangas (#26)

Re: WAL format changes

Excerpts from Heikki Linnakangas's message of lun jun 25 20:09:34 -0400 2012:

On 25.06.2012 21:01, Robert Haas wrote:

On Mon, Jun 25, 2012 at 1:57 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

"<<" should be">>". The attached patch fixes this typo.

Oh, I forgot to attach the patch.. Here is the patch.

I committed both of the patches you posted to this thread.

Thanks Robert. I was thinking that "pg_resetxlog -l" would accept a WAL
file name, instead of comma-separated tli, xlogid, segno arguments. The
latter is a bit meaningless now that we don't use the xlogid+segno
combination anywhere else. Alvaro pointed out that pg_upgrade was broken
by the change in pg_resetxlog -n output - I changed that too to print
the "First log segment after reset" information as a WAL file name,
instead of logid+segno. Another option would be to print the 64-bit
segment number, but I think that's worse, because the 64-bit segment
number is harder to associate with a physical WAL file.

So I think we should change pg_resetxlog -l option to take a WAL file
name as argument, and fix pg_upgrade accordingly.

The only thing pg_upgrade does with the tli/logid/segno combo, AFAICT,
is pass it back to pg_resetxlog -l, so this plan seems reasonable.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#28

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Heikki Linnakangas (#26)

Re: WAL format changes

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

So I think we should change pg_resetxlog -l option to take a WAL file
name as argument, and fix pg_upgrade accordingly.

Seems reasonable I guess. It's really specifying a starting WAL
location, but only to file granularity, so treating the argument as a
file name is sort of a type cheat but seems convenient.

If we do it that way, we'd better validate that the argument is a legal
WAL file name, so as to catch any cases where somebody tries to do it
old-style.

BTW, does pg_resetxlog's logic for setting the default -l value (from
scanning pg_xlog to find the largest existing file name) still work?

regards, tom lane

#29

Amit Kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Tom Lane (#28)

Re: WAL format changes

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

So I think we should change pg_resetxlog -l option to take a WAL file
name as argument, and fix pg_upgrade accordingly.

Seems reasonable I guess. It's really specifying a starting WAL
location, but only to file granularity, so treating the argument as a
file name is sort of a type cheat but seems convenient.

If we do it that way, we'd better validate that the argument is a legal
WAL file name, so as to catch any cases where somebody tries to do it
old-style.

BTW, does pg_resetxlog's logic for setting the default -l value (from
scanning pg_xlog to find the largest existing file name) still work?

It finds the segment number for largest existing file name from pg_xlog and
then compare it with input provided by the
user for -l Option, if input is greater it will use the input to set in
control file.

With Regards,
Amit Kapila.

#30

Peter Eisentraut

peter_e@gmx.net

over 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: WAL format changes

On fre, 2012-06-15 at 00:01 +0300, Heikki Linnakangas wrote:

1. Use a 64-bit segment number, instead of the log/seg combination. And
don't waste the last segment on each logical 4 GB log file. The concept
of a "logical log file" is now completely gone. XLogRecPtr is unchanged,
but it should now be understood as a plain 64-bit value, just split into
two 32-bit integers for historical reasons. On disk, this means that
there will be log files ending in FF, those were skipped before.

A thought on this. There were some concerns that this would silently
break tools that pretend to have detailed knowledge of WAL file
numbering and this previous behavior of skipping the FF files. We could
address this by "fixing" the overall file naming from something like

00000001000008D0000000FD
00000001000008D0000000FE
00000001000008D0000000FF
00000001000008D100000000

00000001000008D0FD000000
00000001000008D0FE000000
00000001000008D0FF000000
00000001000008D100000000

which represents the new true WAL stream numbering as opposed to the old
two-part numbering.

Thus, any tool that thinks it knows how the WAL files are sequenced will
break very obviously, but any tool that just looks for 24 hexadecimal
digits will be fine.

I wonder if any tools in the former category would also break if one
changes XLOG_SEG_SIZE.

#31

Greg Stark

stark@mit.edu

over 13 years ago

In reply to: Heikki Linnakangas (#1)

Re: WAL format changes

On Thu, Jun 14, 2012 at 10:01 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

This has the advantage that you can calculate the CRC for all the other
fields before acquiring WALInsertLock. For xl_prev, you need to know where
exactly the record is inserted, so it's handy that it's the last field
before CRC.

It may be late to mention this but fwiw you don't need to reorder the
fields to do this. CRC has the property that you can easily adjust it
for any changes to the data covered by it. Regardless of where the
xl_prev link is you can calculate the CRC as if xl_prev is 0 and then
once you get the lock "add in" the correct xl_prev. This is an
argument in favour of using CRC over other checksums for which that
would be hard or impossible.

--
greg